Evaluation and Benchmarks for Agents

Delivering enterprise-grade AI agents at unprecedented speed

Production Agents

Month Delivery

720

Test Queries

270

Curated Files

About the Client

A global enterprise operating across healthcare, finance, telecom, and education engaged Tbrain to stand up domain-specific Q&A agents and a practical evaluation framework they could operate in-house. The assignment prioritized realism, safety, and speed to value.

Objective

Deliver 6 production-grade agents grounded in authentic, approved knowledge and a turnkey evaluation package that the client could run immediately - achieved in 1 month from kickoff to handoff.

Each agent would answer only when evidence exists in its corpus and refuse clearly when evidence is absent, with every expected answer traceable to source material.

The Challenge

Timeline & Scale

weeks to deliver

≥45

files per agent

120

prompts per agent

Evaluation Requirements

100

Answerable prompts

Strictly from corpus

Unanswerable prompts

To validate refusal behavior

Each corpus mixed formats such as PDF, DOCX, PPTX, XLSX/CSV, HTML, and SharePoint pages, with layout variety like nested headings, footnotes, long tables, charts, and images. Files spanned small, medium, and large sizes.

The query set had to feel human - covering fact-seeking, procedural, comparison, multi-part, hypothetical queries with realistic patterns like misspellings and domain-term paraphrases. Many prompts required combining evidence across 2, 5, and 10+ documents.

Tbrain's Strategic!

Tbrain executed a pod-based operating model to maximize throughput within the 1-month window. Multiple teams worked in parallel, each owning the end-to-end lifecycle for one agent under central coordination.

Pod-Based Operating Model

Project Manager

Timeline • Unblocking • Quality gate

↓

Team 1

•Domain Expert

•Query Writer

•Reviewer + QA

Team 2

•Domain Expert

•Query Writer

•Reviewer + QA

Team 3

•Domain Expert

•Query Writer

•Reviewer + QA

Team 4

•Domain Expert

•Query Writer

•Reviewer + QA

Team 5

•Domain Expert

•Query Writer

•Reviewer + QA

Team 6

•Domain Expert

•Query Writer

•Reviewer + QA

Five-Stage Workflow

Corpus Curation

Sourcing authentic documents, normalizing formats, removing duplicates - each agent holds ≥45 files

Query Generation

Producing ~120 realistic prompts per agent covering single and cross-document reasoning

Ground-Truth Mapping

Attaching span-level evidence to every answerable query and marking unanswerable prompts for safe refusal

Quality Review

Enforcing rubric alignment, inter-rater checks, and policy verification for audit-ready outputs

Final Packaging

Assembling test-ready bundles approved by Team Leads for immediate handoff

Evaluation Rubric & Metrics

Every response is compared to the approved corpus with one outcome: Correct, Needs Correction, or Refusal Required.

✓ Correctness

Factually accurate using approved corpus only - verified against exact source passages

✓ Instruction Following

Respects scope, format, jurisdiction, dates, units, and length constraints

✓ Evidence & Citation Quality

Points to exact span, section, page, or table cell for quick verification

✓ Safety & Compliance

Politely refuses out-of-scope queries - no speculation or external sources

✓ Clarity & Formatting

Concise, readable, delivered in requested structure with proper units and labels

✓ Coverage

Measures answerable query share - signals corpus adequacy, not model score

Outcome & Impact

Project Deliverables

Test-Ready Agents

Production-grade with full documentation

270

Curated Files

Across all program domains

720

Human-Written Prompts

Balanced evaluation sets

160

Safety Prompts

Unanswerable items for refusal testing

Client Benefits

Turnkey evaluation framework ready to run internally for benchmarking and fine-tuning
Every answer mapped to precise supporting passages for streamlined review and audits
Reproducible & scalable - includes templates and checklists to extend the program at the same pace
Reduced time-to-value while raising confidence in both grounded accuracy and refusal behavior

Ready to Build Enterprise-Grade AI Agents?

Let Tbrain help you deliver production-ready agents on enterprise timelines

Connect Us Today