Our clients reserves the right not to make an appointment. In considering candidates for appointment into advertised posts, preference will be accorded to persons from a designated group in accordance with the approved Employment Equity Plan.

AI Operations Specialist (YD/AIOS/29/9/25)

Overview

Reference
YD/AIOS/29/9/25

Salary
ZAR0 - ZAR0/month

Job Location
- South Africa -- Johannesburg Metro -- Johannesburg

Job Type
Permanent

Posted
01 October 2025

Closing date
31 Oct 2025 21:59


The AI Operations Specialist is responsible for ensuring the stability, reliability, and performance of AI systems in production. This role blends Machine Learning Operations (MLOps), Bot Operations, and Quality Assurance (QA) practices, supporting both backend models and customer-facing bots.

The specialist will manage deployments, monitor system health, conduct testing, and validate quality to guarantee seamless AI operations and a consistent customer experience.


Key Responsibilities

  • System Monitoring & Reliability

    • Monitor AI systems to ensure uptime, performance, and error-free operation.

    • Implement automated monitoring and alerting systems for AI performance.

    • Define SLOs and error budgets for AI features.

  • Testing & Quality Assurance

    • Design, develop, and maintain automated test scripts for web, mobile, and API testing.

    • Execute manual test cases for exploratory and non-automatable scenarios.

    • Identify, log, and track bugs to closure using Jira, Azure DevOps, or equivalent tools.

    • Ensure adherence to QA methodologies, tools, and best practices.

  • AI & Bot Operations

    • Manage version control, deployment, and release cycles of AI models and bots.

    • Conduct offline/online evaluations and A/B tests for prompts, models, and policies.

    • Track and troubleshoot production issues (latency, escalations, fallback errors).

    • Design and orchestrate conversations using Voiceflow and related tools.

  • Observability, Security & Compliance

    • Work with observability/evaluation tools (Langfuse, Arize/Phoenix, W&B, Prometheus, Grafana).

    • Implement guardrails, safety red-teaming, and prompt-injection defenses.

    • Manage PII handling, content safety filters, and data loss prevention (DLP).

    • Document incidents, perform RCA reporting, and ensure compliance with data privacy and security policies.

  • Collaboration & Continuous Improvement

    • Collaborate with developers, data scientists, and business teams to resolve operational issues.

    • Participate in incident on-call rotations, maintain runbooks, and conduct disaster recovery (DR) tests.

    • Recommend improvements to enhance system resilience, reduce downtime, and optimize costs.


Requirements

  • Bachelor’s degree in Computer Science, Engineering, or related field (or equivalent experience).

  • 3+ years of experience in operations, DevOps, or AI/ML support roles.

  • Proven track record managing large, complex, multi-stakeholder projects.

  • Strong knowledge of Machine Learning practices (model monitoring, retraining, pipelines).

  • Familiarity with conversational AI platforms (Amazon Lex, Salesforce Einstein Bots, ElevenLabs).

  • Integration experience with Amazon Connect, Genesys Cloud CX, NICE CXone, or similar.

  • Proficiency with test automation tools (Selenium, Playwright, Cypress, Appium, or equivalent).

  • Experience with API testing tools (Postman, RestAssured, Karate).

  • Strong scripting and automation skills (Python, Bash, CI/CD pipelines).

  • C1 English proficiency.


Nice to Have

  • Experience with performance/load testing tools (JMeter, Locust, Gatling).

  • Knowledge of cloud platforms (AWS, Azure, GCP).

  • Familiarity with Git or other version control systems.

  • QA certification (ISTQB or equivalent).


Contact information

Yandiswa D