Engineering Manager, Fleet Reliability
The Engineering Manager, Fleet Reliability at fal will lead the team responsible for ensuring the continuous operation and reliability of the company's GPU nodes. This role is critical as fal's fleet is poised for significant expansion, requiring a leader to establish and oversee the operating model, develop playbooks, and set high standards for the team's performance.
Key responsibilities include building and leading the Fleet Reliability team by hiring, developing, and retaining top talent. The manager will own 24/7 coverage for node provisioning, validation, and triage, driving the automation roadmap with a focus on event-driven remediation, self-healing systems, and observability. Additionally, they will define and enforce service level agreements (SLAs) to ensure production GPUs consistently serve traffic and set the team's culture regarding performance metrics, communication, and growth.
The ideal candidate will have over seven years of experience in infrastructure, software, or site reliability engineering (SRE), including at least two years in a leadership role. Experience running a fleet reliability or hardware operations team in a production environment is essential. The candidate should have a proven track record of implementing SRE fundamentals from scratch, such as incident management, postmortems, observability, and change management, and a strong inclination towards automating repetitive tasks to reduce manual toil.
This position offers the opportunity to work in a dynamic environment where the fleet is expected to grow tenfold, providing significant challenges and opportunities for professional growth. The company values a process-oriented approach without unnecessary bureaucracy and encourages a hands-on leadership style where leaders are willing to carry the pager themselves before asking their team to do so.