Project overview

About the client

A global enterprise operating across healthcare, finance, telecom, and education engaged Tbrain to stand up domain‑specific Q&A agents and a practical evaluation framework they could operate in‑house. The assignment prioritized realism, safety, and speed to value.

Objective

Deliver 6 production‑grade agents grounded in authentic, approved knowledge and a turnkey evaluation package that the client could run immediately achieved in 1 month from kickoff to handoff. Each agent would answer only when evidence exists in its corpus and refuse clearly when evidence is absent, with every expected answer traceable to source material.

Challenge

The program had to land in 4 weeks while meeting enterprise-grade constraints that would typically span an entire quarter. Within that timeline, the team needed to deliver 6 credible agents that behave safely in regulated settings, each grounded in ≥ 45 approved files, with every answer traceable to an exact source passage and no reliance on external web search when the corpus does not contain evidence. The evaluation set per agent was fixed at 120 human-written prompts 100 answerable strictly from the corpus and 20 intentionally unanswerable to validate graceful, policy-compliant refusal so realism and safety had to be proven side by side rather than staged.

Realism was specified down to the structure and messiness of the source material. Each corpus mixed formats such as PDF, DOCX, PPTX, XLSX/CSV, HTML, and SharePoint pages or libraries, and included layout variety like nested headings, footnotes, long tables, charts, and images embedded in documents. Files spanned small, medium, and large sizes, and spreadsheets featured hundreds of rows across multiple worksheets, with practical “noise” like merged headers, inconsistent units, and formatting irregularities. Several use cases also required country-aware retrieval in which folder names and metadata constrained answers to the correct jurisdiction, without introducing fragility at runtime.

The query set had to feel human in more than tone. It covered a wide range of intent types – from fact seeking and procedural to comparison, multi-part, hypothetical, and short follow-ups – varied clarity levels, different lengths, and realistic patterns like keyword shortcuts, misspellings, domain-term paraphrases, and web-style phrasing. Coverage requirements ensured that many prompts were only answerable by combining evidence across 2, 5, and 10+ documents rather than single-file lookup. For every answerable prompt, reviewers prepared span-level ground truth; for every unanswerable prompt, they wrote a clear justification explaining why it must be refused even if topically related. All corpora, queries, expected answers, and justifications were human-authored and human-verified before handoff.

Acceptance criteria raised the operational bar. Agents had to run cleanly on the client’s agent platform without authentication or runtime errors; responses had to stay on-domain, include valid citations when evidence exists, and withstand subject-matter review for usefulness, correctness, clarity, formatting, and citation validity. Delivery needed to be fully auditable through a shared tracker listing agent metadata, corpus composition, query attributes, answer locations, ground-truth snippets, and links back to sources.

Tbrain’s Strategic Solution

Tbrain executed a pod‑based operating model to maximize throughput within the 1‑month window. Multiple teams in parallel, each owning the end‑to‑end lifecycle for one agent under the coordination of a central Project Manager. Inside each team, a Domain Expert curated domain‑faithful materials, a Query Writer authored natural‑language prompts that mirror professional intent and tone, and a Reviewer‑QA validated corpus integrity, answer traceability, and refusal behavior. To accelerate without sacrificing consistency, we operated a KC (knowledge corpus) checklist system and a query checklist that gated each stage, paired with standardized templates for corpus intake, query specifications, and ground‑truth annotation, these controls kept artifacts synchronized across pods and reduced rework. Runtime behavior followed an auditable flow from input through planning, retrieval and tool use, execution, and verification, with a human‑in‑the‑loop whenever confidence thresholds or policy checks required intervention.

Five‑stage workflow

Production moved through five integrated stages as a continuous line that repeated for each agent. Corpus curation established the foundation by sourcing authentic, real‑world documents, normalizing formats, removing duplicates, and screening for privacy and policy constraints until each agent held at least 45 files that captured facts, procedures, and the tone of internal communications. With the corpus stabilized, query generation produced roughly 120 realistic prompts per agent that covered single‑document questions and cross‑document reasoning, including conditional and exception‑driven scenarios. Ground‑truth mapping then attached span‑level evidence to every answerable query and marked intentionally unanswerable prompts that would exercise safe refusal. Quality review enforced rubric alignment, inter‑rater checks, targeted sampling, and policy verification so outputs were consistent and audit‑ready. Final packaging assembled agents and artifacts into a test‑ready bundle that Team Leads approved and handed off without further clarification. Across the program we delivered 6 agents supported by 270 curated documents, 720 human‑written queries, and 160 unanswerable safety prompts – all completed in 4 weeks.

Evaluation Rubric & Metrics

Evaluation follows clear and auditable criteria so any reviewer can make confident decisions without opaque scores. Every response is compared to the approved corpus. For each prompt the reviewer records one outcome – Correct, Needs Correction, or Refusal Required – and includes a link to the exact supporting passage when that applies. The rubric focuses on six dimensions: correctness, instruction following, evidence and citation quality, safety and compliance, clarity and formatting, and coverage.

Correctness: A response counts as correct only when it is factually accurate using the approved corpus and nothing else. Reviewers verify the claim against the exact passage in the source. If the evidence does not support the claim, or if supported facts are mixed with speculation, the item is marked Needs Correction. Common errors include misreading a table, citing an outdated policy, or combining unrelated facts.
Instruction following: The response must respect the constraints in the prompt, including scope, required format, jurisdiction, dates, units, and length. An answer can be accurate yet still fail if it ignores these constraints. This keeps outputs usable in operations.
Evidence and citation quality: When an answer is appropriate it must point to the exact span, section identifier, page, or table cell that supports it so reviewers can confirm provenance quickly. If the corpus does not contain evidence, the correct behavior is to state that the information is not available rather than to guess. Missing, vague, or incorrect citations lead to Needs Correction even if the narrative seems plausible.
Safety and compliance: When a query falls outside the scope or raises policy or PII risk, the agent should politely refuse and explain that the information cannot be provided. Speculation is not permitted, and external web sources are not to be used. This protects users in regulated contexts and prevents unsupported assertions from entering downstream workflows.
Clarity and formatting: The response should be concise, readable, and delivered in the structure the prompt requests, for example a short paragraph, a list, or a simple table. Formatting should keep key units, dates, and labels so the result is ready for business users.
Coverage: Coverage measures the share of tested queries that are answerable within the approved corpus. It is reported at the dataset and domain level so teams can tell whether gaps come from model behavior or missing source content. Coverage is a signal about corpus adequacy, not a model score.

These criteria keep the process straightforward and repeatable. Reviewers adjudicate each prompt against the corpus, attach the precise evidence when applicable, and record the outcome using the three labels above. The result is an evaluation record that is easy to audit, easy to compare across agents or iterations, and useful for go or no go decisions, fine-tuning, and ongoing quality monitoring.

Outcome

Within 1 month, the client received 6 test‑ready agents and a turnkey evaluation framework they can run internally to benchmark, fine‑tune, and validate performance on realistic, trustworthy data. Each agent ships with an approved corpus totaling 270 curated files across the program, a balanced evaluation set of 720 human‑written prompts, and a safety suite of 160 unanswerable items that prove out‑of‑scope detection and graceful refusal. Every expected answer is mapped to the precise passage that supports it, which streamlines review and supports audits.

The handover also includes the KC, the query checklist, and the living templates used during production so the client can extend the program at the same pace and with the same standards. Because the framework is reproducible and scalable, new domains can be brought online quickly, and the harness can be wired into continuous evaluation with simple reporting and drift alerts tied to policy and knowledge changes. For teams under pressure to deliver useful, safe agents on enterprise timelines, this program reduces time‑to‑value while raising confidence in both grounded accuracy and refusal behavior.

Categories:

Uncategorized