Observability, Staff Infrastructure Engineer
As a Staff Infrastructure Engineer specializing in observability at Graphcore, you will play a pivotal role in designing, implementing, and deploying scalable management and monitoring solutions for our next-generation AI infrastructure. Collaborating with software, cloud, and customer-facing teams, you will develop proof-of-concepts, reference designs, and integrations with third-party tools to enhance our data center operations.
Your primary responsibilities will include contributing to all phases of product development, from definition and architecture to implementation and early customer support. You will design and implement fault-remediation solutions at scale, integrate multi-component systems for seamless management and monitoring, and create comprehensive documentation and reference designs. Additionally, you will deploy internal solutions to support engineering efforts and maintain and improve deployed infrastructure to provide optimal service to our customers.
The ideal candidate will possess a BSc or MSc degree in Computer Engineering, Computer Science, or a related field, or equivalent experience. You should have a proven track record in architecting and implementing scalable, reliable cluster management systems, particularly focusing on telemetry collection and analysis. Experience in managing large-scale data centers with an emphasis on hardware observability solutions is essential. Proficiency in maintaining and scaling modern observability stacks using tools such as Prometheus, Grafana, OTEL, ClickHouse, Kafka, Superset, or Elastic Stack is required. Strong programming skills in C, C++, Go, or Python, along with excellent written and verbal communication abilities, are also necessary.
Graphcore offers a competitive salary, annual leave policy, medical and dental health plans, a gym card, and an employee pension matched up to 4%. We are committed to building an inclusive work environment and provide flexible interview approaches to accommodate reasonable adjustments.