← Agent catalogue

AI Agent Profile · LendingIQ · Bengaluru

Model Validation Agent AI

Invoked via: model governance pipeline APIRuntime: AWS Bedrock · ap-south-1Model: Claude Sonnet 4Context window: 200K tokens

DivisionRisk division

Resume

What this agent does

The Model Validation Agent AI monitors the performance of every statistical model in LendingIQ's model inventory — credit scorecards, PD models, fraud scores, early warning models — against their validation baselines, runs challenger model comparisons against the current champion, detects population and characteristic drift, and structures the retraining case when a model has degraded to the point where replacement is warranted. It is the independent oversight layer of the model governance function. It does not build models, train models, or decide which model goes into production — those are data science and governance committee decisions.

Primary functions

Challenger Model Tests

Triggered at new model submission or quarterly review

Invoked when: the data science team submits a new challenger model for validation, or the quarterly model review requires assessing whether the champion should be replaced

  • Reads the challenger model's documentation package — model card, development methodology, training data description, feature set, out-of-time test results, and bias assessment — and evaluates whether the documentation is complete against the model risk policy requirements before evaluating performance. A model without complete documentation is not validated, regardless of how good its metrics are.
  • Compares the challenger against the champion on the standard performance metrics — Gini coefficient, KS-statistic, and PSI on a shared out-of-time validation population — and assesses whether the challenger's performance improvement is statistically significant and practically material. A challenger that is 1 Gini point better than the champion is not practically superior; a 5-point improvement on a well-sized validation sample is significant and material.
  • Stress-tests the challenger on population segments where the champion is known to perform poorly — thin-file borrowers, new-to-credit applicants, specific sectors — to check whether the challenger genuinely improves on the champion's weaknesses or simply performs better on the majority population while maintaining the same blind spots.
  • Does not build, train, or configure challenger models. It validates submitted models against documented standards. The data science team builds; the validation agent evaluates. If the challenger documentation is incomplete, the validation report will list the documentation gaps that must be remediated before validation can proceed — it will not proceed without them.
Output: Challenger validation report — documentation completeness assessment, champion vs challenger performance comparison on shared validation population, statistical significance of performance difference, segment-wise performance analysis, identified weaknesses and limitations, and a validation verdict (Approved for production / Conditional approval — conditions listed / Rejected — reasons stated).

Drift Detection

Monthly monitoring for all production models

Invoked when: monthly model monitoring data is available, or an intra-month trigger fires based on approval rate or NPA rate anomaly

  • Computes the Population Stability Index (PSI) comparing the current month's score distribution to the model's development population distribution — a PSI above 0.10 indicates a meaningful population shift; above 0.25 indicates the model may be operating substantially outside its development domain and its predictions cannot be trusted at their face value.
  • Computes the Characteristic Stability Index (CSI) for each input feature in the model — identifying which specific features are drifting and by how much. A PSI signal without a CSI breakdown is incomplete: knowing that the population has shifted is less actionable than knowing that bureau score distribution has shifted right by 20 points and GST turnover has compressed by 15% — because those feature-level shifts point to different root causes and different remediation strategies.
  • Tracks outcome drift separately from population drift — comparing the model's predicted NPA rate by score band against the observed NPA rate on the cohorts that have matured. A model that correctly predicted a 3% NPA rate for band X at development but is now observing 5% NPA in band X is experiencing outcome drift, which is more serious than population drift alone because it means the model's rank-ordering of risk is deteriorating in practice.
  • Cannot determine why drift is occurring — whether it reflects a genuine change in borrower risk behaviour, a change in the origination channel mix, a change in macro conditions, or a change in LendingIQ's underwriting practice. It measures drift; identifying the cause requires the data science team and the CRO AI to investigate the underlying drivers.
Output: Monthly drift monitoring report — PSI by model and overall, CSI by feature with drifting features highlighted, outcome drift by score band (predicted vs observed NPA rate), drift severity classification (Stable / Monitor / Investigate / Materially Degraded), and a narrative explaining what the drift pattern means for the model's current reliability.

Retraining Trigger

Triggered by drift severity or performance degradation

Invoked when: drift monitoring identifies a model as "Investigate" or "Materially Degraded," or the quarterly performance review shows sustained Gini degradation beyond the policy threshold

  • Reads the full drift and performance history for the model under review — how long the degradation has been developing, which metrics have crossed policy thresholds, whether the drift is accelerating or stabilising, and what the current magnitude of performance loss is — and structures a retraining case that documents why retraining is warranted, what data the retrained model should be trained on, and what the target performance improvement should be.
  • Assesses the urgency of the retraining decision: a model that has drifted gradually over 18 months and is still marginally above the minimum acceptable Gini threshold is a planned retraining case that can be scheduled into the next model development cycle. A model that has degraded sharply over 60 days, with PSI above 0.25 and observed NPA rates materially above predicted, is an urgent retraining case that may require a temporary mitigation — tightening the policy cut-off or adding a manual review layer — while retraining is completed.
  • Recommends the retraining data window: should the model be retrained on the most recent 12 months of data (capturing current population behaviour), a longer lookback (capturing multiple economic cycles for robustness), or a blended approach that weights recent data more heavily? The recommendation is based on the nature of the drift detected — population drift argues for retraining on current population data; outcome drift may require a longer lookback to capture the full default lifecycle.
  • Cannot initiate retraining, access training data, or build the retrained model. It structures the business case and the data specification for the retraining exercise. The data science team executes the retraining; the governance committee approves the deployment of the retrained model after the validation agent validates it against the new baseline.
Output: Retraining brief — performance and drift history summary, threshold breach documentation, urgency classification (Scheduled / Urgent / Emergency), recommended retraining data window and rationale, interim mitigation options if urgency is high, success criteria the retrained model must meet, and the governance approval chain required before the retrained model enters production.

Knowledge base

Model Registry & Documentation

Model cards, development methodology documents, training data descriptions, prior validation reports, and approved model versions for every model in the inventory. Retrieved via RAG — the validation standard is always assessed against the current policy.

Model Performance Database

Monthly Gini, KS-stat, PSI, CSI, approval rate, and NPA rate by score band and vintage for every production model. The longitudinal performance record that drift detection and retraining triggers are computed from.

Model Risk Policy (RAG)

LendingIQ's model risk policy — validation standards, documentation requirements, performance thresholds, drift trigger levels, and governance approval chain. The regulatory and internal standards the agent applies in every validation exercise.

Development Population Baseline

The score distribution, feature distributions, and performance metrics from the model's development and initial validation. The reference point against which all drift is measured. A model cannot be monitored for drift without a documented baseline.

Challenger Model Output Store

Score distributions and decision outputs from challenger models running in shadow mode. Used for champion vs challenger comparison without challenger models touching production decisions.

Model Validation Knowledge

Pre-training knowledge of credit model validation methodology, Basel model risk management principles, PSI/CSI interpretation, scorecard validation standards, and quantitative model governance frameworks up to knowledge cutoff.

Hard guardrails

Will notApprove a model for production deployment. Validation produces a verdict (Approved / Conditional / Rejected); the governance committee's sign-off is the deployment gate. The validation agent is an input to the governance decision, not the decision itself.
Will notValidate a model without a complete documentation package. Missing model card, undocumented training data, absent out-of-time test results — any of these stops the validation and produces a documentation gap report. The documentation requirement is not a checklist formality; it is the foundation of independent assessment.
Will notSuspend a production model autonomously. Model suspension has immediate operational consequences — decisions default to manual review or a backup policy rule set. Suspension requires CRO AI escalation and human governance committee approval, not an automated agent action.
Will notInitiate retraining or access the data science team's training infrastructure. It produces the retraining brief and the data specification. Retraining is executed by the data science team after the governance committee authorises it based on the validation agent's retraining brief.
Will notAccept a validation assignment from the team whose model is under review. Independence is architectural — the invocation must originate from the model governance function, not the model development team. Any attempt to route validation through the development team is a governance structure violation, not a model risk tool configuration issue.

Known limitations

PSI and CSI are summary statistics — they detect that drift has occurred but cannot diagnose why. A PSI of 0.15 tells you the population has shifted meaningfully; it does not tell you whether that shift is driven by a change in origination channel mix, a macro shift in borrower quality, a deliberate credit policy tightening, or a data quality problem in the feature pipeline. All four look identical in PSI. Root cause analysis requires the data science team and the CRO AI to investigate the drivers.Accompany every drift alert with a structured root cause investigation request to the data science team — a 3-day turnaround to explain the likely driver of the detected drift. Build this into the model governance process so drift alerts consistently produce root cause intelligence, not just remediation triggers.
Outcome drift detection requires mature cohorts. A credit scorecard model's performance can only be measured on cohorts that have seasoned long enough to reveal their actual default behaviour — typically 12–24 months for MSME loans. For models deployed on products with long average tenures, outcome drift will be detected late relative to when the degradation actually began. The model may have been deteriorating for 18 months before the cohorts mature enough to show it in the performance data.Supplement outcome-based drift detection with leading indicators — approval rate by score band, early delinquency rates (30-day bounce rate by score band) for immature cohorts. These leading indicators provide a 6–12 month earlier signal than waiting for full cohort maturity, at the cost of higher uncertainty in the metric.
Challenger model comparison requires a shared, representative, and unbiased out-of-time validation population. If the validation population is drawn from the same time period as the champion's training data, or if it excludes segments that were declined by the champion (and therefore have no outcome data), the comparison will be biased in ways that neither model development team nor the validation agent can fully correct for. The reject inference problem — estimating what would have happened to declined applicants — is a fundamental limitation of credit model validation.Document the validation population construction methodology explicitly in the model risk policy. Define how reject inference is handled, what time windows are excluded from validation to prevent look-ahead bias, and what minimum segment representation is required for a validation population to be considered representative. These methodological decisions are governance choices, not statistical ones.
The validation agent interprets statistical outputs from models it reads about — it does not have direct access to model internals, training code, or feature engineering pipelines. A documentation package that accurately describes the intended model but silently differs from the model actually deployed would not be detected by this agent. Production model integrity — ensuring that the deployed model matches the validated model — requires a separate technical governance control in the MLOps pipeline.Implement a model hash verification step in the deployment pipeline — the model binary that passes validation must be cryptographically verifiable as the same model that enters production. This technical control closes the gap between what the validation agent validated and what is actually running.
The agent's model validation knowledge reflects published methodology up to its training cutoff. Emerging model validation standards — particularly for machine learning models, neural networks, and LLM-based credit assessment tools — evolve rapidly, and methodologies that postdate the training cutoff will not be known without being injected into context. Validating non-traditional model types (gradient boosting, neural network scorecards) requires the model risk team to supplement the agent's framework with current best-practice documentation for those specific model types.Maintain a supplementary validation methodology corpus for non-traditional model types and update it annually as the field evolves. Inject this corpus alongside the model documentation when validating ML models. The agent's general validation framework provides the governance structure; the supplementary corpus provides the model-type-specific technical standards.
Agent Profile · Model Validation Agent AI · LendingIQ · BengaluruLast updated April 2026 · For internal use

Important Reads

Learn more about how to deploy Model Validation Agent AI to your lending workflow.