GovTech is the lead agency driving Singapore’s Smart Nation initiatives and public sector digital transformation. As the Centre of Excellence for Infocomm Technology and Smart Systems (ICT & SS), GovTech develops the Singapore Government’s capabilities in Data Science & Artificial Intelligence, Application Development, Smart City Technology, Digital Infrastructure, and Cybersecurity.

Within the AI Practice, the AI Infrastructure & Operations (AI Infra & Ops) team defines the reference architecture and best practices for running AI at scale. Rather than owning a single product, we work across multiple agencies and use cases — helping teams stand up production-grade AI infrastructure, establishing governance standards, and building reusable platform components that accelerate adoption government-wide.

As a Senior AI Infra & Ops Engineer, you will be a technical leader and trusted advisor — shaping architectural standards, uplifting engineering capability across agencies, and ensuring that AI workloads across government are production-ready, cost-efficient, and secure. You will operate at the intersection of deep technical expertise and stakeholder engagement, working with a diverse set of agency teams at varying levels of AI maturity.

At GovTech, we offer you a purposeful career to make lives better where we empower our people to master their craft through robust learning and development opportunities all year round.

Play a part in Singapore’s vision to build a Smart Nation and embark on your meaningful journey to build tech for public good. Join us to advance our mission and shape your future with us today!

Learn more about GovTech at tech.gov.sg.

[What you will be working on]

You will work across multiple agencies and projects simultaneously, combining hands-on engineering with advisory and enablement work. You will be responsible for:

You will bridge the gap between experimental data science and production-grade engineering, working across multiple agencies and use cases rather than a single product. You will be responsible for:

MLOps & Lifecycle Management: Design, build, and maintain automated CI/CD pipelines for ML models. Establish whole-of-government standards for model versioning, automated testing, canary deployments, and rollback strategies. Build reusable pipeline components and templates that agencies can adopt to move toward self-service model delivery.
Model Optimization & Quantization: Apply and benchmark advanced techniques (e.g., quantization, pruning, distillation, speculative decoding) to optimize LLMs and other deep learning models for inference. Evaluate inference engines (vLLM, TensorRT-LLM, Triton Inference Server) and publish recommendations balancing performance, cost, and resilience. Provide hands-on support to agency teams applying these to their workloads.
Infrastructure & Platform Architecture: Architect scalable, multi-tenant AI infrastructure (on-prem and cloud) to support training and inference workloads, with a focus on GPU orchestration (e.g., Kubernetes, Volcano, Run:ai, NVIDIA GPU Operator). Define reference architectures, capacity planning guidance, and FinOps practices that agencies can adopt.
Observability & Monitoring: Design and champion observability standards for AI systems — covering model health, data/concept drift, system latency, GPU telemetry, and cost attribution. Define SLO/SLI frameworks to ensure AI services are transparent, debuggable, and maintainable at scale.
Security & Governance: Collaborate with Responsible AI and cybersecurity stakeholders to embed security guardrails and compliance checks into standardized deployment pipelines, ensuring AI systems meet government security standards (e.g., IM8, CSA guidelines). Drive governance frameworks covering model lineage, reproducibility, and approval workflows.
Technical Consulting & Enablement: Act as a senior subject matter expert for government agencies, conducting architecture reviews, maturity assessments, and providing actionable roadmaps to scale AI from POC to production. Develop reusable playbooks, reference implementations, and training materials. Mentor engineers within the AI Practice and across partner agencies — the goal is to build self-sufficient teams.

[What we are looking for]

Education: Bachelor's or Master's degree in Computer Science, Computer Engineering, or a related technical field. Postgraduate study or research in distributed systems, ML systems, or high-performance computing is a plus.
Experience: At least 6–8 years of progressive experience in DevOps, MLOps, Cloud Infrastructure, or Platform Engineering, with a significant portion focused on AI/ML workloads in production environments. Experience in a consultancy, platform team, or CoE setting — where you've had to advise and enable multiple teams rather than just build for one — is highly valued.
Core Technical Skills:
- Strong proficiency in Python and shell scripting; working knowledge of at least one systems language (Go, Rust, etc.) is a plus.
- Deep, hands-on expertise with containerization and orchestration (Docker, Kubernetes) in production settings, including experience operating multi-tenant clusters.
- Proven experience designing MLOps platforms and patterns using frameworks such as Kubeflow, MLflow, BentoML, or equivalent tooling.
- Advanced Infrastructure-as-Code (IaC) skills with Terraform or Ansible — including module design, state management, and CI/CD integration for infrastructure.
- Experience with GitOps workflows (ArgoCD) and platform engineering patterns (internal developer platforms, self-service portals, golden paths).
Domain Expertise:
- Strong understanding of GPU hardware and performance tuning for distributed training and inference.
- Hands-on experience with LLM serving stacks (vLLM, TensorRT-LLM, Triton Inference Server, KServe) and model optimization techniques (quantization, distillation, pruning).
- Expertise in observability stacks (Prometheus, Grafana, OpenTelemetry, ELK/Loki) and ML-specific monitoring tools (Arize, Langfuse, etc.).
- Solid understanding of networking fundamentals and service mesh architectures.
Consulting & Leadership Skills:
- Demonstrated ability to engage with multiple stakeholder teams at varying levels of technical maturity and tailor advice accordingly. You can be equally effective whiteboarding with senior engineers and presenting infrastructure roadmaps to agency leadership.
- Experience conducting architecture reviews, writing technical guidance documents, and developing reusable standards that others adopt.
- Track record of mentoring engineers and building capability in others — the measure of success is teams that become self-sufficient, not teams that stay dependent on you.
- Strong written and verbal communication skills. You can produce clear, concise technical documentation and present confidently to both technical and non-technical audiences.
- A proactive, systems-thinking mindset — you anticipate failure modes across diverse environments and can triage the highest-impact recommendations when agency teams have limited bandwidth.
- Passion for building public good through scalable, well-governed technology.
Preferred / Bonus:
- Experience managing large-scale GPU clusters or cloud AI services in production.
- Prior experience in a consulting, professional services, or internal platform/CoE role serving multiple product teams.
- Contributions to open-source projects in the MLOps, Kubernetes, or AI infrastructure space.
- Certifications in Cloud Architecture or Kubernetes .
- Experience with cost optimization at scale — FinOps practices, spot/preemptible instance strategies, and resource right-sizing for GPU workloads.
- Familiarity with emerging paradigms: multi-modal model serving, agentic AI infrastructure, retrieval-augmented generation (RAG) pipelines, and edge inference.

Learn more about life inside GovTech at go.gov.sg/GovTechCareers.
Stay connected with us on social media at go.gov.sg/ConnectWithGovTech

Senior Software Engineer, AI Infrastructure & Operations (AI Practice)

About your application process