Saturday, 27 August 2016

INTRODUCTION

Credit scoring is used by banks and other financial institutions to evaluate potential risk associated with  lending credit. This project attempts to train a model that can classify applicants as either credit worthy or of high risk of defaulting. The data set used is from a housing finance company from India. The data set has 981 observations with 13 variables. The variables are loan ID, Gender, Married, Dependents, Education, Self Employed, Applicant Income, Co-applicant Income, Loan Amount, Loan Amount Term, Credit History, Property Area and the target variable is Loan Status. 

DATA EXPLORATION
Data exploration enables a data scientist understand the data better before feeding the data to the model. From data exploration outliers and missing data can be dealt with.Data exploration is key to building a good model since, If you employ a good model on data of poor quality the outcome will be undesirable, GIGO (Garbage In Garbage Out). For numerical variables median imputation was applied. For categorical variables either creating a new level for missing values or mode imputation was used. Data was split into two sets a train set to be used to train the model and a test set without the target variable for validating the model.

VISUALIZATIONS
R statistical software was the tool used for this project. The ggplot2 package was used for visualization. Below is a summary of the insights obtained from visualization of the train set.

Loan Status

The train set  had 614 observations. 192  (31%)  of them were considered risky and  their applications denied. 422 (69%) had their applications approved.

Married


From the bar bar chart it is evident being unmarried has higher chance of your application not being approved.

Education

39% of non-graduates had their applications rejected as compared to 29% of graduates


Applicant Income





Distribution of income followed the same distribution for those whose applications were rejected and accepted. The distribution of income was positively skewed as is usual of income distributions







The table below shows summary of  Income from both classes of applicants

train$Loan_Status: N
   Min.   1st Qu.   Median    Mean   3rd Qu.    Max. 
    150      2885      3834       5446     5861       81000 
------------------------------------------------------------------------------ 
train$Loan_Status: Y
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    210    2878    3812         5384    5772     63340 


Co-applicant Income




From the box plots it is evident that applicants whose applications were accepted had co-applicants with higher median income than their counterparts.








The table below is a summary of the variable co-applicants income

train$Loan_Status: N
   Min. 1st Qu.  Median    Mean   3rd Qu.    Max. 
      0       0          268        1878     2274       41670 
------------------------------------------------------------------------------ 
train$Loan_Status: Y
   Min. 1st Qu.  Median    Mean   3rd Qu.    Max. 
      0        0         1240        1505    2297        20000 

Credit History



The table below is a cross tabulation of credit history and loan status
            N                          Y
  0     0.92134831      0.07865169
  1     0.20421053      0.79578947

92% of applicants who had their credit history not meeting guidelines had their applications rejected. 79% of those who had their credit histories meet guidelines had their applications approved.


Property Area





The semi-urban population had the highest approval rates as compared to the other two populations.








FEATURE ENGINEERING

Three new features were engineered from the existing variables

  1. Total Income - Applicant Income + Co-applicant Income
  2. Installments - Loan Amount / Loan Amount Term
  3. Installment_Income_Ratio -  Installments / Applicant Income

MODELLING

Three classification techniques were used namely :
  • Logistic Regression (step-wise backward elimination)
  • Decision trees
  • Random Forests
Step-Wise Regression

Step-wise logistic regression was applied on the train set. This model found the following variables to be most predictive.

  • Installment_Income_Ratio  
  • Married                
  • Property_Area 
  • Credit_History            
The model was used to make predictions for the test set data comprising 367 observations.
This model predicted 16% of the test set observations would be denied and 84% would have their applications accepted.

This model had an accuracy rate of  0.784722222222.

Decision Trees

A decision tree algorithm from the R package Rpart was  used to train the data and predict loan status for the test set. After training the tree it was pruned to avoid over fitting hence minimizing classification error. A cost matrix was also used that penalized accepting a defaulter five times more than rejecting a credit worthy customer.








On the left is the decision tree before it was pruned.








The decision tree had an accuracy rate of 0.777777777778.

Random Forest
The random forest package was imported from the R library and employed on the train set. The model arranged the variables according to their predictive strengths as shown below

The random forest model had an accuracy rate of  0.798611111111.

CONCLUSION

Of the three models random forest performed best. Most financial institutions prefer using decision trees or logistic regression since they are simpler to interpret. Random forest is more powerful but ranks lower as a choice since it is a black-box method.

Credit scoring is a very useful  risk adjudication solution that can be used to assess creditworthiness of loan applicants for lending institutions.

The complete R script code can be found on my git hub page https://github.com/chrisliti/Credit-Scoring.  


No comments:

Post a Comment