INTRODUCTION
Credit scoring is used by banks and other financial institutions to evaluate potential risk associated with lending credit. This project attempts to train a model that can classify applicants as either credit worthy or of high risk of defaulting. The data set used is from a housing finance company from India. The data set has 981 observations with 13 variables. The variables are loan ID, Gender, Married, Dependents, Education, Self Employed, Applicant Income, Co-applicant Income, Loan Amount, Loan Amount Term, Credit History, Property Area and the target variable is Loan Status.
DATA EXPLORATION
Data exploration enables a data scientist understand the data better before feeding the data to the model. From data exploration outliers and missing data can be dealt with.Data exploration is key to building a good model since, If you employ a good model on data of poor quality the outcome will be undesirable, GIGO (Garbage In Garbage Out). For numerical variables median imputation was applied. For categorical variables either creating a new level for missing values or mode imputation was used. Data was split into two sets a train set to be used to train the model and a test set without the target variable for validating the model.
VISUALIZATIONS
R statistical software was the tool used for this project. The ggplot2 package was used for visualization. Below is a summary of the insights obtained from visualization of the train set.
Loan Status
The train set had 614 observations. 192 (31%) of them were considered risky and their applications denied. 422 (69%) had their applications approved.
Married
From the bar bar chart it is evident being unmarried has higher chance of your application not being approved.
Education
39% of non-graduates had their applications rejected as compared to 29% of graduates
Applicant Income
Distribution of income followed the same distribution for those whose applications were rejected and accepted. The distribution of income was positively skewed as is usual of income distributions
The table below shows summary of Income from both classes of applicants
train$Loan_Status: N
Min. 1st Qu. Median Mean 3rd Qu. Max.
150 2885 3834 5446 5861 81000
------------------------------------------------------------------------------
train$Loan_Status: Y
Min. 1st Qu. Median Mean 3rd Qu. Max.
210 2878 3812 5384 5772 63340
Co-applicant Income
From the box plots it is evident that applicants whose applications were accepted had co-applicants with higher median income than their counterparts.
The table below is a summary of the variable co-applicants income
train$Loan_Status: N
Min. 1st Qu. Median Mean 3rd Qu. Max.
0 0 268 1878 2274 41670
------------------------------------------------------------------------------
train$Loan_Status: Y
Min. 1st Qu. Median Mean 3rd Qu. Max.
0 0 1240 1505 2297 20000
Credit History
The table below is a cross tabulation of credit history and loan status
N Y
0 0.92134831 0.07865169
1 0.20421053 0.79578947
92% of applicants who had their credit history not meeting guidelines had their applications rejected. 79% of those who had their credit histories meet guidelines had their applications approved.
Property Area
The semi-urban population had the highest approval rates as compared to the other two populations.
FEATURE ENGINEERING
Three new features were engineered from the existing variables
- Total Income - Applicant Income + Co-applicant Income
- Installments - Loan Amount / Loan Amount Term
- Installment_Income_Ratio - Installments / Applicant Income
MODELLING
Three classification techniques were used namely :
- Logistic Regression (step-wise backward elimination)
- Decision trees
- Random Forests
Step-Wise Regression
Step-wise logistic regression was applied on the train set. This model found the following variables to be most predictive.
- Installment_Income_Ratio
- Married
- Property_Area
- Credit_History
The model was used to make predictions for the test set data comprising 367 observations.
This model predicted 16% of the test set observations would be denied and 84% would have their applications accepted.
This model had an accuracy rate of 0.784722222222.
Decision Trees
A decision tree algorithm from the R package Rpart was used to train the data and predict loan status for the test set. After training the tree it was pruned to avoid over fitting hence minimizing classification error. A cost matrix was also used that penalized accepting a defaulter five times more than rejecting a credit worthy customer.
On the left is the decision tree before it was pruned.
The decision tree had an accuracy rate of 0.777777777778.
Random Forest
The random forest package was imported from the R library and employed on the train set. The model arranged the variables according to their predictive strengths as shown below
The random forest model had an accuracy rate of 0.798611111111.
CONCLUSION
Of the three models random forest performed best. Most financial institutions prefer using decision trees or logistic regression since they are simpler to interpret. Random forest is more powerful but ranks lower as a choice since it is a black-box method.
Credit scoring is a very useful risk adjudication solution that can be used to assess creditworthiness of loan applicants for lending institutions.
The complete R script code can be found on my git hub page https://github.com/chrisliti/Credit-Scoring.
No comments:
Post a Comment