Case Study · 01

Cutting call-review time by ~93% with an AI QA pipeline

Designed a five-stage automated pipeline replacing a manual, high-effort per-call QA process across a contact-centre operation.

IndustryTelecom / Comparison Platform
My RoleTechnical Product Manager (sole PM)
Outcome~93% reduction in per-call review time
DeliverablePRD published to Confluence, pipeline specification

The Problem

The contact centre team was manually reviewing recorded calls to assess agent quality — a labour-intensive process taking roughly 25–30 minutes per call. With hundreds of calls per week, this created a significant bottleneck: QA coverage was low, feedback to agents was delayed by days, and the process relied heavily on individual reviewer judgment with no structured scoring consistency.

The business needed a scalable QA system that could handle full coverage, deliver same-day feedback, and produce consistent, auditable scores — without a proportional increase in headcount.

Constraint: The solution had to fit existing infrastructure (no new cloud contracts) and be explainable to non-technical stakeholders including compliance and operations leads.

My Approach

I started with a week of discovery — sitting with QA reviewers, mapping their scoring rubric, and identifying where judgment calls were most inconsistent. The core insight was that roughly 70% of the scoring criteria were rule-based (was a disclosure made? was the correct product name used?), and the remaining 30% required contextual judgment (tone, empathy, handling objections).

This split informed the pipeline architecture: use a transcription layer for speed, an LLM for contextual evaluation, and a rule engine for deterministic criteria — keeping humans in the loop only for edge cases and final audit.

Pipeline Architecture

~93%Reduction in per-call review time
100%Call coverage (vs ~15% manual)
Same-dayAgent feedback vs 2–3 day lag

What I Produced

A full PRD published to Confluence covering: problem statement, success metrics, pipeline specification, prompt design rationale, failure modes and mitigations, data privacy considerations, and a phased rollout plan. I also produced a one-page executive summary for the operations and compliance leads.

Challenges & Trade-offs

The main tension was between automation speed and auditability. Compliance stakeholders were cautious about an LLM making scoring decisions on regulated calls. I addressed this by making the reasoning traces visible, building in a mandatory human-review queue for any call touching a compliance criterion, and framing the LLM as an "analyst assistant" rather than a decision-maker. This framing unlocked sign-off.

A secondary challenge was prompt stability — LLM outputs needed to be consistent across evaluators. I invested significant time in prompt design and testing, including a small human-vs-model calibration exercise to validate scoring alignment.

What I'd Do Differently

I'd involve the agent team earlier. QA scoring felt opaque to agents, and the new system's transparency (showing reasoning per criterion) was actually a bigger benefit than anticipated — one I didn't initially include in the business case. Earlier agent interviews would have surfaced this and strengthened the proposal.

← All case studies Next: Payment Tokenisation →