Use case #0002

Proxy Variable Detection: How Fair Lending AI Finds Hidden Bias in Pin Codes

A credit model variable that is correlated with a protected characteristic is a proxy variable — and using a proxy variable in a credit model produces discrimination that is statistically indistinguishable from using the protected characteristic directly. The Fair Lending AI tests every model variable for proxy correlation monthly. The pincode is the most common offender — and the hardest to detect without systematic analysis.

Why Proxy Variables Are the Primary Source of Algorithmic Discrimination in India

Direct discrimination by protected characteristic is legally and morally unacceptable, and no regulated institution would consciously design it into a credit model. But proxy variables introduce discrimination through the back door — and they do so in ways that are genuinely difficult to detect without running the right tests.

India's social geography is particularly prone to proxy variable problems. Pincodes are not neutral zip codes — they carry significant social information. In Mumbai, certain pincodes are strongly associated with particular religious communities. In Chennai, pincodes correlate with caste geography. In Bengaluru, the distinction between established localities and newer development areas correlates strongly with migrant status and economic origin. A credit model that uses pincode as a variable — whether for property value adjustment, local employment stability, or historical loan performance — is potentially encoding these social correlations into its credit decisions.

The same logic applies to other seemingly neutral variables: employer type (correlates with caste and community in certain sectors), educational institution attended (correlates with socioeconomic origin), business sector (correlates with community in traditional trading communities), and even mobile network operator in some geographies (correlates with language and community). The Fair Lending AI tests all of them.

"A credit model that uses pincode as a variable may not intend to discriminate by religion. But if that pincode correlates at 0.74 with religious community membership, the model is effectively doing exactly that — with mathematical precision."

The Pincode Proxy Correlation Analysis

Proxy Correlation Analysis — Pincode vs Protected Characteristics

Mumbai Metropolitan Area · 847 Pincodes Analysed · Nov 2025

Pincode / Area	Approval Rate in Model	Corr. with Religion (proxy)	Corr. with Community Income	Model Impact	Proxy Status
400008 / Bhendi Bazaar	41.2%	r = 0.89	r = 0.62	Pincode adds −18 score points vs. adjacent 400001	High proxy risk
400012 / Mahim	44.8%	r = 0.74	r = 0.58	Pincode adds −14 score points vs. matched-income adjacent areas	High proxy risk
400097 / Govandi	51.3%	r = 0.52	r = 0.71	Lower approval primarily income-explained (r = 0.71 with income quartile)	Monitor — income may justify
400050 / Bandra West	72.1%	r = 0.18	r = 0.84	High approval rate primarily income-explained — low proxy correlation	Clean
400070 / Kurla East	54.2%	r = 0.23	r = 0.68	Below-average approval primarily income/employment mix — acceptable	Clean

The Four Variable Categories the AI Tests for Proxy Correlation

Geographic Variables Highest Risk Category

Pincode, District, Branch, Property Location

Geographic variables are the most common proxy for religion, caste, and community in India because residential segregation along these lines is historically documented and statistically measurable. Any geographic variable — whether used for property market adjustment, employment stability estimation, or historical NPA rates — must be tested against religious and community composition data for the geography.

→ Remediation: Replace pincode with income-quartile of pincode + property market index (income-adjusted)

Employer and Sector Variables Medium Risk

Employer Name, Industry Sector, Business Type

Certain industry sectors and business types correlate with community in India — jewellery, textiles, and trading businesses are associated with specific communities in different geographies. A model that assigns different risk weights to business sector without correcting for this proxy correlation may be systematically penalising members of those communities.

→ Test: sector approval rates by community proxy — flag if r > 0.50 after income controls

Name-Derived Variables Medium Risk

Name Length, Script, or Pattern Features

Name-based features — including name length, script (Devanagari vs Tamil vs Arabic), or suffix patterns — are sometimes used in fraud detection or identity verification models. These features correlate directly with religion and community and should never appear in credit scoring models. The proxy test checks whether any engineered name feature has entered the credit model through the feature engineering pipeline.

→ Zero tolerance: name-derived features in credit models automatically flagged for removal

Education and Social Variables Emerging Risk

Educational Institution, Language, Vehicle Type

As alternative data sources expand, variables like educational institution attended, language of application, vehicle type (owned), or subscription services create new proxy risks. These correlate with socioeconomic origin, language community, and caste in ways that are not always obvious. The Fair Lending AI tests every new variable added to the model pipeline for proxy correlation before deployment.

→ Pre-deployment proxy test mandatory for all new variables — gate before model update

The Corrective Action When a Proxy Is Found

Identifying a proxy variable is the beginning of the governance action, not the end. The Fair Lending AI's proxy finding triggers a structured assessment: what is the variable's predictive value for credit risk independent of the proxy correlation, and can the legitimate credit information it carries be captured by a non-discriminatory alternative?

For pincodes correlated with religious community, the answer is typically yes: the information the pincode carries about property market liquidity and local income levels can be captured by replacing the raw pincode with an income-quartile rank of the pincode combined with a property market index — preserving the legitimate predictive content while removing the proxy correlation. For a variable like employer name that carries almost no legitimate credit signal but a high proxy correlation, the answer is removal. For a variable that carries substantial legitimate signal and cannot be replaced, the answer is a disparity monitoring overlay that flags any decisions where the variable produced a disparate outcome, for additional human review.

847Pincodes analysed for proxy correlation in the Mumbai metro area alone

r = 0.89Highest pincode-religion proxy correlation detected — well above the 0.70 flag threshold

4Variable categories tested — geographic, sector, name-derived, and alternative data

Pre-deployEvery new variable tested for proxy correlation before it enters any production model

The Proxy Variable Is Not Evidence of Intent — But It Is Evidence of Effect

No lending institution intends to discriminate by religion when it uses pincode as a credit variable. But if that pincode is associated with a religious community at a correlation of 0.89, the model's use of pincode is producing discrimination that is functionally identical to using religion directly — with the additional problem that it is invisible without the proxy analysis. The Fair Lending AI runs the proxy analysis every month for every variable in every model. The institution that can show this analysis, its findings, and its remediation actions is an institution that takes fair lending seriously as a practice rather than a declaration.