Job Description
Unlock the Power of Supercomputing for AI Innovation
The Agency for Science, Technology and Research (A*STAR) is looking for a passionate and skilled HPC AI Engineer to join our high-performance computing (HPC) team at the National Supercomputing Centre (NSCC). In this role, you will be at the forefront of the digital revolution, enabling cutting-edge AI research that directly impacts healthcare, environmental science, and advanced manufacturing.
As part of the Frontier project, you will bridge the gap between complex AI algorithms and massive-scale compute infrastructure. This is an exceptional opportunity for technical professionals who want to work with world-class supercomputing resources to solve real-world problems. You will collaborate with elite researchers and industry partners to optimize machine learning models, manage massive datasets, and push the boundaries of what is possible in artificial intelligence.
The ideal candidate is someone who thrives in a collaborative, research-driven environment and possesses a deep understanding of both the software and hardware aspects of AI scaling. Join us in Singapore's vibrant innovation ecosystem and contribute to the next generation of scientific breakthroughs.
Responsibilities
- Architect and optimize large-scale AI training pipelines on HPC clusters to ensure maximum resource utilization.
- Collaborate with interdisciplinary research teams to port and scale deep learning models (NLP, Computer Vision, etc.) across multi-node GPU systems.
- Develop and maintain containerized environments (Singularity, Docker) for reproducible AI research.
- Implement efficient data management and I/O strategies for handling massive scientific datasets.
- Perform performance profiling and bottleneck analysis for AI workloads using tools like NVIDIA Nsight or PyTorch Profiler.
- Stay abreast of emerging technologies in distributed computing, AI frameworks, and hardware acceleration.
- Contribute to the technical documentation and knowledge sharing within the NSCC ecosystem.
Qualifications
- Master’s or PhD in Computer Science, Computational Science, or a related technical field.
- Proven experience working in High-Performance Computing (HPC) environments using schedulers like Slurm or PBS.
- Expertise in Python and deep learning frameworks such as PyTorch, TensorFlow, or JAX.
- Strong knowledge of parallel computing principles, including MPI, OpenMP, and NCCL.
- Hands-on experience with GPU acceleration (CUDA) and performance tuning for multi-GPU training.
- Familiarity with Linux system administration and shell scripting.
- Excellent communication skills and the ability to work effectively in a diverse, research-oriented team.