Welcome to the part 2 of Loan Prediction Analysis. In this section I will cover how to
use the understanding of data we made in first part to build effective machine learning model. At the end of this part 2 you will get Idea about the process of model building,
data preprocessing and feature selection. so without wasting any further time let's get started.
df.isnull().sum()
#---------------Null_value_imputations--------------------- Here we will go with simple imputation based on distribution of dataset.
#----Loan_ammount------
plt.hist(df['LoanAmount'])
The distribution of the LoanAmount is skewed. If we fill the null values with mean it would not be appropriate. It can affect the distribution more but if we go with median then it could be more safer choice.
df['LoanAmount'] = df["LoanAmount"].fillna(df["LoanAmount"].median())
plt.hist(df['Loan_Amount_Term'])
The loan amount term is categorical variable. To fill null value of categorical variable we go with maximum frequency.
df['Loan_Amount_Term'].value_counts()
df['Loan_Amount_Term'] = df['Loan_Amount_Term'].fillna(value = 360.0)
df['Credit_History'] = df['Credit_History'].fillna(value = 1.0)
df['Gender'] = df['Gender'].fillna(value = 'Male')
df['Self_Employed'] = df['Self_Employed'].fillna(value = "No")
df['Dependents'] = df['Dependents'].fillna(value = 0.0)
Now all the null values in the given datset is filled with appropriate value based on distribution strategies. There could be lot of different methods one can try to fill missing values in the data. The methods discussed in this article are very basic and simplistic ones. One can try different methods based on demand of specific problem. If one have huge datset and deleting some of rows does not affect scope of the problem then in that case deleting missing valued rows is best option.
"We will discuss about three feature selection methods. Each of these method belong to different category of feature selection method. 1)Filter_methods 2)Wrapper methods 3)Embedded Methods.
#----Filter_methods-------- The filter methods includes correlation filter method. In this we methods we filter out those feature which are having lower correlation with target variable and having high correlation with other features.
def mapping_funtion(series):
final_list = []
for i in range(0,len(series)):
if series[i] == 'Y':
final_list.append(1)
else:
final_list.append(0)
return final_list
df['Loan_Status'] = mapping_funtion(df['Loan_Status'])
df_corr = df.corr()
plt.figure(figsize=(10,8))
sns.heatmap(df_corr,annot=True)
From the above filter we got only one feature which is having high positive correlation with target variable. But we have got lots of features which are having less correlation with other variables in dataset. The main reason behind looking at features which are having less correlation with other features in dataset is that we want to reduce redundancy in the dataset. If we keep features in the dataset which are correlated with other. which means carries almost same information then we will end up in situation where we will not be able to determine which feature is contributing in the decision process.
# List of selected feature from correlation filter method is following
corr_selected_feature = ['Credit_History','Dependents','Loan_Amount_Term','ApplicantIncome', 'CoapplicantIncome']
Here Credit_History looks like very good feature as it has negative correlation with almost all the feature and high correlation with target variable. On the other hand Loan_Amount feature is having comparatively higher correlation with all other features in the dataset. Talking about selection of other feature in list, They are having lower in between features correlation
df.head()
#------------Convert_data_to_numeric_format-------------------------
categorical_features = []
for column in df.columns:
if df[column].dtype == "object":
categorical_features.append(column)
To implement further feature selection methods we will need machine learning model. In the later section we will be covering wrapper methods and embedded methods for feature selection. So see the impact of feature selection we will first start with all feature model and then subsequently move towards selected feature model. To implement above mentioned feature selection we will need to convert all the categorical columns to numeric as computer can not understand English words.
#---------------One_hot_encoding_method----------------------------
for column in categorical_features:
temp = pd.get_dummies(pd.Series(df[column]))
df=pd.concat([df,temp],axis=1)
df=df.drop([column],axis=1)
df.head()
target = df["Loan_Status"]
df = df.drop('Loan_Status',axis=1)
#---------------Split_dataset_into_train_validation_set----------------
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(df,target,test_size = 0.20,random_state=42)
#------------------import_model_packages_from_sklearn------------------------ from sklearn.ensemble import RandomForestClassifier from sklearn.tree import DecisionTreeClassifier from sklearn.linear_model import LogisticRegression from sklearn.neighbors import KNeighborsClassifier from sklearn.model_selection import cross_val_score
#---------------Build_model_with_all_the_features---------------------------- #--------------Implement_decision_tree_model-------------------- all_feature_model_knn = KNeighborsClassifier() print(all_feature_model_knn ) all_feature_model_knn.fit(X_train,y_train) print("\n") print("Average Cross_validation Score for KNN is {} ".format(np.mean(cross_val_score(all_feature_model_knn,X_train,y_train,cv=5)))) #--------------Implement_decision_tree_model-------------------- all_feature_model_DC = DecisionTreeClassifier() print(all_feature_model_DC ) all_feature_model_DC.fit(X_train,y_train) print("\n") print("Average Cross_validation Score for decisiontree is {} ".format(np.mean(cross_val_score(all_feature_model_DC,X_train,y_train,cv=5)))) #-------------Implement_random_forest_model-------------------- all_feature_model_RF = RandomForestClassifier() print(all_feature_model_RF ) all_feature_model_RF.fit(X_train,y_train) print("\n") print("Average Cross_validation Score for randomforest is {} ".format(np.mean(cross_val_score(all_feature_model_RF,X_train,y_train,cv=5)))) #--------------Implement_logistic_regression_model-------------------- all_feature_model_LR = LogisticRegression() print(all_feature_model_LR ) all_feature_model_LR.fit(X_train,y_train) print("\n") print("Average Cross_validation Score for logistic_regression is {} ".format(np.mean(cross_val_score(all_feature_model_LR,X_train,y_train,cv=5))))
#---------Check_model_performance------------------------ from sklearn.metrics import classification_report print("The performance_of_KNN is. {}".format(classification_report(y_test,all_feature_model_knn.predict(X_test)))) print("The performance_of_DC is. {}".format(classification_report(y_test,all_feature_model_DC.predict(X_test)))) print("The performance_of_RF is. {}".format(classification_report(y_test,all_feature_model_RF.predict(X_test)))) print("The performance_of_LR is. {}".format(classification_report(y_test,all_feature_model_LR.predict(X_test))))
Comments