Operate and optimise production AI platforms for enterprise customers.
You will be responsible for the day-to-day operational health of production AI systems across a portfolio of enterprise customers. This includes monitoring, incident response, performance optimisation, and continuous improvement of platform reliability.
What you'll do
- Monitor and maintain production AI platform health across customer environments
- Own incident response and resolution for managed services customers
- Drive performance tuning and reliability improvements
- Develop runbooks, operational playbooks, and automation tooling
- Work with engineering teams on handover from delivery to managed operations
What we're looking for
- Background in platform, site reliability, or managed services engineering
- Experience operating cloud-native infrastructure (AWS, GCP, or Azure)
- Strong incident management and troubleshooting skills
- Exposure to AI/ML platform operations is a strong advantage