AI Platform Architect
Graphcore is seeking an AI Platform Architect to design and oversee the comprehensive infrastructure stack that powers our most demanding distributed AI workloads. This role acts as the unifying technical authority across hardware, software, compute, network, and storage, ensuring a cohesive AI rack-scale platform optimized for trillion-parameter LLM training and high-throughput inference. By orchestrating advanced clustering and distributed training frameworks down to the physical layer, you will provide our AI research and deployment teams with a flawless and extraordinarily powerful platform.
Key responsibilities include defining the holistic architecture for highly clustered AI environments, ensuring zero-bottleneck data flow between parallel storage systems, AI compute nodes, and ultra-high-bandwidth network fabrics. You will influence the strategy for AI workload scheduling and orchestration, utilizing tools like Kubernetes or Slurm to manage distributed training jobs, model checkpointing, and inference serving at massive scale. Additionally, you will profile and eliminate system-level bottlenecks across the entire AI pipeline, tuning everything from deep learning frameworks down to OS-level configurations.
The ideal candidate will have demonstrated experience in systems engineering, cloud architecture, or HPC, with at least 4+ years functioning as a Lead or Principal Architect for large-scale AI or machine learning platforms. Deep practical knowledge of how large models are trained and deployed, including data/tensor/pipeline parallelism and the infrastructure requirements of modern LLM architectures, is essential. An authoritative understanding of system-level bottlenecks and data pathways, including familiarity with PCIe Gen 5/6, NVMe namespaces, and RDMA integration, is also required.
Graphcore offers a dynamic work environment with flexible working hours, comprehensive benefits, and opportunities for personal development. As a wholly owned subsidiary of SoftBank Group, Graphcore is part of an elite family of companies responsible for some of the world’s most transformative technologies, providing a unique opportunity to contribute to the future of AI computing.