Today, when a customer applies for a loan, a loan officer works with the client to gather documents such as financial statements (typically audited), business plans, financial projections, invoices, and more. They then analyze them, calculate ratios, and start checking off boxes in their risk acceptance criteria spreadsheets. However, a significant portion of this process is subjective. Are their projections realistic? Does the business owner have “good character”? This subjectivity can lead to inconsistencies and potential biases.
This process can take hours, days, or even weeks, and ultimately, confidence is still low, and collateral is required. Not only is this process expensive—after all, time is money—but it also excludes most of the market. You cannot go through this whole process for a 10,000 Cedi, 1M Naira, or $1000 loan.
Fortunately, the evolution of data science and technology, coupled with the widespread use of bank accounts and mobile payments, has paved the way for a new era in credit scoring, particularly in emerging markets. Innovations like Oze harness the power of machine learning algorithms to accurately predict credit risk. Let’s delve into the mechanics of this transformative process.
Let’s start with the basics. There are two categories of data: variables and labels. Or even simpler, inputs and outputs. In the case of credit scoring, inputs can come from bank statements, Oze account data (how often do they input sales?), account information (like account opening date), demographic information (age), loan information (what month they are applying in), and other data (for example, do they have a Whatsapp for business account). Because the algorithm is doing the analysis, we can consider many variables per source – on a 90-day bank statement, Oze takes over 125 variables, such as the minimum credit on the account. We then need to look at the label (the output). The output is default or closed (paid off).
Oze’s model takes fresh variables from a new loan application, runs them through the algorithm and makes a prediction on the output — how likely is this customer to default on this loan?
In preparing to build a model, Oze undertakes two key activities, data cleaning and feature engineering. First data is cleaned and preprocessed to remove inconsistencies, missing values, and noise, ensuring the model has high-quality data to work with. Then our data scientists can create new features from raw data that can enhance the model’s predictive power, such as aggregating transaction patterns or creating time-based features. Once that’s done, we are ready to build the model.
As for techniques, Oze uses a random forest model with gradient boosting. Let’s break that down.
A random forest is a collection of decision trees.
A decision tree is like a flowchart, where each junction represents a question about the data and each branch represents an answer to that question. For example, a node might ask, “How long has the customer had an account at this bank?” Based on the answer (less than 1 year, 1-5 years, 5-10 years, etc.), it follows a different path.
Each tree in the forest is trained on a different subset of the data and makes its own prediction (i.e. Will this customer default)? The final prediction is made by averaging the predictions of all the trees or taking a majority vote.
Gradient Boosting builds decision trees one at a time, where each new tree corrects errors made by the previous trees. It provides high accuracy by focusing on difficult cases and refining predictions incrementally.
Random Forests are great at providing stable and robust predictions by averaging the results of many trees. Gradient Boosting, on the other hand, fine-tunes predictions by focusing on errors made by previous trees. The upside of using these techniques together is that we can make highly accurate predictions even when data is scarce. The downside is that these models are considered “black box”.
A “black box” model refers to a machine learning or artificial intelligence model whose internal workings are not easily understood or interpretable by humans. This term is often used to describe complex models, such as deep neural networks, where the decision-making process is hidden behind layers of calculations and interactions between variables. While “black box” models often offer superior predictive power and accuracy compared to simpler, more interpretable models (like linear regression), the trade-off is that they sacrifice transparency and interpretability. This often makes bankers hesitant about using these models despite the fact that “black box” machine learning algorithms often outperform traditional models in predictive accuracy, leading to better loan approval decisions, which reduces defaults and increases profitability.
Additionally, machine learning algorithms can help minimize human biases by focusing purely on data-driven insights, ensuring fairer decisions across diverse applicant pools. Despite the “black box” nature, these models can be designed to comply with regulatory requirements, with mechanisms for monitoring and auditing decisions. With the proper training, relationship managers can manage customer communications and support rejected borrowers so that they can apply again in the future with better odds of success. The benefits of these models far outweigh the costs.
Oze’s model demonstrates the strong performance possible using machine learning algorithms for credit scoring. The AUC (Area Under the Curve) of the model is 91%. The AUC is a metric used to evaluate the accuracy of a model in distinguishing between different outcomes—in this case, predicting whether a customer will default on a loan or not. An AUC of 91% means that the model is highly effective, correctly identifying the likelihood of defaults in 91% of cases, making it a reliable tool for credit scoring. For calibration, an AUC is considered “good” at 70%, very good at 80% or above, and excellent at 90% or above.
Once we are all happy with the model, it is deployed into production systems to make real-time credit-scoring decisions. This is only the beginning. Oze monitors the model’s performance over time, ensuring it remains accurate as market conditions and customer behaviors change. Our models are retrained after every 3000-5000 loans to improve their predictive power as more data becomes available.
The traditional loan application process, reliant on subjective evaluations by loan officers, is time-consuming, costly, and prone to inconsistencies and biases, often excluding small loans due to its inefficiency. However, advancements in data science and technology have enabled the development of innovative credit scoring systems like Oze, which utilize machine learning algorithms to accurately predict credit risk. By analyzing a wide range of data inputs, such as bank statements and demographic information, Oze’s model makes precise predictions on loan defaults. The model employs a combination of random forests and gradient boosting, offering high accuracy even with limited data. While these “black box” models are complex and less transparent, they significantly reduce bias, improve decision-making, and enhance profitability.
Oze’s model, with an impressive AUC of 91%, demonstrates the strong performance possible using machine learning algorithms for credit scoring. With proper oversight and regulatory compliance, these models have the potential to transform credit scoring, especially in emerging markets.