Introduction
I recently participated in a hackathon hosted by www.analyticsvidhya.com. I finished 6th on the Private Leader board amongst a very competitive crowd. This is an excellent example of a “Predictive Lead Scoring” problem faced by businesses in multiple sectors including Banking, Insurance, Financial Services, Retail, Manufacturing & FMCG. Hence I decided to cover my approach to solve this business problem in this post.
Problem Description
Customer Bank is a mid-sized private bank which deals in all kinds of loans, having branches across all major cities in the country.
Digital arms of banks today face challenges with lead conversion, they source leads through mediums like search, display, email campaigns and via affiliate partners. Here Customer Bank faces same challenge of low conversion ratio.
The challenge is to identify the customers segments having higher propensity to opt for a specific loan product.In order to maximize ROI from Marketing spends the bank needs to target only those customers who are most likely to go for a loan product.
Data
Customer details based on the last 3 months transactions. We were charged with identifying the segment of customers having higher disbursal rate in next 30 days.
Input variables:
ID - Unique ID
Gender- Sex
City - Current City
Monthly_Income - Monthly Income in rupees
DOB - Date of Birth
Lead_Creation_Date - Lead Created on date
Loan_Amount_Applied - Loan Amount Requested
Loan_Tenure_Applied - Loan Tenure Requested (in years)
Existing_EMI - EMI of Existing Loans
Employer_Name - Employer Name
Salary_Account- Salary account with Bank
Mobile_Verified - Mobile Verified (Y/N)
Var5- Continuous classified variable
Var1- Categorical variable with multiple levels
Loan_Amount_Submitted- Loan Amount Revised and Selected after seeing Eligibility
Loan_Tenure_Submitted- Loan Tenure Revised and Selected after seeing Eligibility (Years)
Interest_Rate- Interest Rate of Submitted Loan Amount
Processing_Fee- Processing Fee of Submitted Loan Amount
EMI_Loan_Submitted- EMI of Submitted Loan Amount
Filled_Form- Filled Application form post quote
Device_Type- Device from which application was made (Browser/ Mobile)
Var2- Categorical Variable with multiple Levels
Source- Categorical Variable with multiple Levels
Var4- Categorical Variable with multiple Levels
Gender- Sex
City - Current City
Monthly_Income - Monthly Income in rupees
DOB - Date of Birth
Lead_Creation_Date - Lead Created on date
Loan_Amount_Applied - Loan Amount Requested
Loan_Tenure_Applied - Loan Tenure Requested (in years)
Existing_EMI - EMI of Existing Loans
Employer_Name - Employer Name
Salary_Account- Salary account with Bank
Mobile_Verified - Mobile Verified (Y/N)
Var5- Continuous classified variable
Var1- Categorical variable with multiple levels
Loan_Amount_Submitted- Loan Amount Revised and Selected after seeing Eligibility
Loan_Tenure_Submitted- Loan Tenure Revised and Selected after seeing Eligibility (Years)
Interest_Rate- Interest Rate of Submitted Loan Amount
Processing_Fee- Processing Fee of Submitted Loan Amount
EMI_Loan_Submitted- EMI of Submitted Loan Amount
Filled_Form- Filled Application form post quote
Device_Type- Device from which application was made (Browser/ Mobile)
Var2- Categorical Variable with multiple Levels
Source- Categorical Variable with multiple Levels
Var4- Categorical Variable with multiple Levels
Target Variables:
LoggedIn- Application Logged
Disbursed- Loan Disbursed
Disbursed- Loan Disbursed
My Approach:
I applied the “Predictive Lead Scoring” methodology discussed in my earlier blog posts, to this problem. I used “R”, the lingua franca of the “Predictive Analytics” applications world to solve this problem.
Step 1: Feature Engineering
- converted DOB into Age in years.
- converted character categorical variables into numeric levels
- imputed missing values and Nas with defaults
- grouped “outliers” into a single category ,so that they wont skew the predictions
Step 2: Exploratory Data Analysis
- identified variables having high correlation with the target variable “Disbursed”.
- These variables would be included in the Predictive Modeling process.
- dropped the variables having no correlation with the with the target variable “Disbursed”.
- normalized the variables so that all variables are of similar numerical order.
Step 3: Predictive Modeling
- I started with “generalized linear models” achieving a decent conversion rate of 75%
- This helped in identifying the relative importance of variables in the prediction process
- I then applied the insights gained to advanced modeling techniques like “random forest” and “boosted trees”.
- With a fair amount of parameter tuning and ensembling techniques I reached a conversion rate of ~85%.
Conclusion:
This model can be used to predict the customers’ propensities to opt for the specific product or service. The marketing efforts can then target only the “top” 10-15% customers based on their propensities.
Thus a business can simultaneously lower their marketing spend and increase the lead conversion rates leading to a higher ROI on Marketing Spend.