Learning point 1
How should I build evaluations for an AI agent in a banking context?
Evaluating an AI agent for banking requires a fundamentally different approach than evaluating traditional software. Software either works or does not - outputs are deterministic. AI agent outputs are probabilistic, contextual, and often involve judgment calls. Your evaluation framework must assess: accuracy on the task, consistency across similar inputs, calibration (does the agent know when it does not know?), compliance with guardrails, and escalation behavior under edge cases.
Start by building a golden dataset - a set of 200 to 500 real historical cases with known correct outcomes. For a credit underwriting agent, this means 200 loan applications where you already know whether the loan was approved, what the credit limit was, and what the key rationale was. Run the agent on all 200. Measure agreement rate, and critically, review the disagreements - are the cases where the agent differs actually cases where the agent was right and the human was wrong?
Beyond accuracy, evaluate consistency: run the same inputs 10 times and check whether the agent produces substantively identical outputs. LLMs can be non-deterministic - small variations in phrasing should not produce wildly different credit decisions. If they do, your prompt needs tightening or you need to reduce the model's temperature setting.
Compliance evaluation is the most critical dimension for a regulated institution. Test every hard guardrail explicitly: does the agent ever recommend an action that violates your policy? Does it ever skip a mandatory check? Build specific test cases designed to try to make the agent violate its guardrails - what prompt injections or edge case inputs might cause it to misbehave? For institutions operating under CBUAE or SAMA supervision, this includes testing alignment with the CBUAE Guidance Note's five principles - governance, fairness, transparency, human oversight, and data privacy - with evidence packaged for supervisory dialogue.
Learning point 2
What metrics should I track for an AI agent in production?
Production metrics for AI agents fall into four categories: task performance metrics, operational metrics, business outcome metrics, and safety metrics. A comprehensive monitoring dashboard covers all four. Tracking only task performance while ignoring safety metrics is a common and dangerous oversight in regulated industries.
| Category | Key metrics | Why it matters |
|---|---|---|
| Task performance | Completion rate, accuracy rate, hallucination rate, escalation rate | Core quality signal; escalation rate out of expected bounds signals prompt drift |
| Operational | Processing time per case, API success rates, token consumption, system availability | 2% error rate on 1,000 applications/day = 20 wrong recommendations daily |
| Business outcomes | TAT reduction vs baseline, cost per loan, 6M/12M portfolio quality, CSAT scores | The metrics that justify investment and identify where the agent needs improvement |
| Safety | Guardrail violation rate, bias metrics by segment, data access anomalies | Non-negotiable; any non-zero guardrail violation requires immediate investigation For GCC lenders, bias monitoring should test for fair treatment across nationality, expatriate status, gender, and emirate/region - segments where historical lending patterns may not reflect future risk. For Islamic banks, additionally test that Shariah-compliant product recommendations are consistent. |
Learning point 3
How do I know if my AI agent is hallucinating, and how do I prevent it?
Hallucination in an AI agent occurs when the model generates confident-sounding statements that are factually incorrect, cite sources that do not exist, or apply policy rules that are not in your actual policy documents. In a lending context, a hallucinating agent might invent a CBUAE circular or SAMA directive that does not exist, apply an eligibility criterion not in your credit policy, or state a borrower's income figure not present in the documents it was given.
Detecting hallucination requires ground-truth verification at multiple levels. For factual claims (interest rates, regulatory thresholds, policy eligibility criteria), every agent output should be traceable to a source document. If the agent cites CBUAE Notice 2024-XX or SAMA Circular, that cited source should exist and should say what the agent claims. Build automated citation checking into your evaluation pipeline.
Prevention is more important than detection. The three most effective hallucination controls for banking agents are: RAG (grounding every response in your actual policy and regulatory documents), low temperature settings on the LLM (reducing creative, speculative outputs), and explicit guardrails in the prompt that instruct the agent to say 'I do not have sufficient information to answer this' rather than speculating.
Learning point 4
What is human-in-the-loop (HITL) and how should it be designed for a lending institution?
Human-in-the-loop (HITL) is the design principle that AI agents should hand off to a human reviewer at defined decision points - particularly where the decision carries high risk, falls outside the agent's training distribution, or requires regulatory accountability. HITL is not a failure of AI; it is a deliberate design choice that makes AI deployments both safer and more regulatorily defensible.
For a lending institution, HITL should be triggered on at least four conditions: when the loan amount or risk exposure exceeds a defined threshold (for example, any loan above AED 2 million (or your institution's defined threshold) goes to a human credit manager for final sign-off), when the agent's confidence score falls below a defined floor, when the case exhibits characteristics the agent has never encountered (out-of-distribution inputs), and when any of the hard guardrail conditions are triggered.
The quality of HITL design determines whether it is a productivity booster or a bottleneck. A well-designed HITL interface gives the human reviewer a complete, structured briefing from the agent - all data analyzed, key findings, policy checks completed, and a clear statement of why the case was escalated - so the reviewer needs five minutes, not fifty, to make a decision.
