1 min read
Eval Pipelines in CI
If prompts, tools, or retrieval logic can change behavior, then evaluation needs to be part of the delivery pipeline.
Why Run Evals in CI
Because LLM regressions often look like:
- lower answer quality
- worse tool selection
- broken formatting
- new safety failures
Traditional unit tests do not catch these well.
What to Evaluate
- task success
- schema correctness
- retrieval relevance
- tool-call correctness
- refusal / safety behavior
- cost and latency budgets
Practical CI Pattern
- run deterministic checks first
- run a focused eval set on changed workflows
- compare scores to baseline
- fail or warn if thresholds regress
Keep CI evals small and high-signal. Larger exploratory evals can run on schedule.
Good Eval Set Design
- representative of real tasks
- versioned like code
- includes adversarial and edge cases
- has clear pass/fail semantics where possible
Interview Answer
Why put evals in CI?
Because LLM systems can regress without traditional code-level failures. CI evals provide an automated quality gate for output quality, tool use, safety, latency, and cost so changes are caught before they reach production.
[prev·next]