CI/CD & Eval Harness for Agents
Operationalized agent evaluation — gating task success, grounding, latency, and drift on every deploy with LangSmith.
Context
Agent behavior is non-deterministic — a prompt tweak or model bump can silently regress grounding or blow up latency. Manual spot-checks didn't scale and didn't catch drift.
We needed reproducible, automated evaluation wired into the same CI/CD that ships the code.
Approach
I built a LangSmith-backed eval harness with curated datasets and scorers for task success, retrieval grounding, output quality, and latency.
GitHub Actions runs the suite on every PR and deploy, publishing a scorecard and blocking merges that regress beyond threshold — responsible-AI guardrails as a continuous engineering process.
PR / deploy
│
▼
┌──────────────┐ ┌──────────────────────┐
│ GitHub Action│───▶│ LangSmith eval suite │
└──────────────┘ │ success · grounding │
│ │ quality · latency │
│ └──────────┬───────────┘
▼ ▼
scorecard ◀────────── pass / block gateOutcome
Every change now ships with an evidence trail. Regressions are caught pre-merge, drift is tracked release-over-release, and non-AI engineers can extend the suite themselves.