Job Description
Are you ready to push the boundaries of what is computationally possible? The Agency for Science, Technology and Research (A*STAR) is seeking a highly skilled HPC AI Engineer to join our elite team at the National Supercomputing Centre (NSCC). In this role, you will be at the forefront of the Frontier initiative, working directly with some of the most powerful supercomputing infrastructure in the world.
We are looking for a visionary engineer who thrives on complexity and is passionate about accelerating AI research. You will bridge the gap between high-performance computing (HPC) architectures and cutting-edge artificial intelligence, optimising massive neural networks and distributed training workloads to achieve unprecedented performance. If you are driven by national-scale impact and want to work with petascale systems to solve real-world problems, this is the environment for you.
Responsibilities
- Architect and deploy highly scalable AI/ML pipelines on distributed HPC clusters.
- Optimise deep learning frameworks (PyTorch, TensorFlow, JAX) for multi-node, GPU-accelerated environments.
- Collaborate with research scientists to profile, debug, and improve the performance of large-scale AI workloads.
- Implement efficient data parallelisation and model parallelisation strategies for massive datasets.
- Maintain and improve software stacks, containers, and orchestration tools (Kubernetes/Slurm) used by the research community.
- Monitor and troubleshoot supercomputing cluster performance to ensure maximum uptime and compute efficiency.
- Stay abreast of industry trends in hardware acceleration, interconnects, and AI hardware architecture.
Qualifications
- Master’s or PhD in Computer Science, Computational Engineering, or a related field.
- Extensive experience with high-performance computing (HPC) and parallel programming (MPI, OpenMP, CUDA).
- Proven expertise in training and deploying deep learning models on large GPU clusters.
- Strong proficiency in C/C++ and Python.
- In-depth knowledge of containerisation technologies (Docker, Apptainer/Singularity) and orchestration (Slurm).
- Familiarity with cloud-native AI workflows and data-intensive computing.
- Excellent analytical skills and the ability to solve complex technical problems in a multidisciplinary environment.