In this article I will walk you through exploratory data analysis for machine learning modeling. This post will help you to get key methods to analyze numerical as well as categorical information in dataset.
For our journey of data analysis, I will be using housing price data from Analytics Vidya.
Data Link - https://datahack.analyticsvidhya.com/contest/practice-problem-loan-prediction-iii/#ProblemStatement
Import necessary python packages
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
Read Dataset into pandas dataframe
df = pd.read_csv('/content/train_ctrUa4K.csv')
df.head()
Pandas read_csv() function helps to read data from csv or excel file and store it into
pandas dataframe. Dataframe are most common and useful data structure in python programming language. Following to read_csv function there head() function which help to see top five rows of the dataframe.
df.info()
info() function extract basic information of data such as datatype, null values, feature name(specified as Column on left side),data dimension etc. This basic information tells
us the features where we need to pay our attention.
df.describe()
describe() function shows distribution of dataset. The mean, standard deviation, quartile information helps to determine the nature of data. The skewness and outliers in dataset can be determined using quartile study.
df.isnull().sum()
Above code determine the number of null values in each feature of the dataset. This information help to focus on preprocessing of particular data feature.
The exploratory analysis, I will start with looking into the distribution of "Loan_status".
This shows that whether number of examples of each class are well balanced or skewed.
If number of examples which our model use for learning are skewed then we must check the degree of skewness as it could lead to model overfitting.
c1 = df[df['Loan_Status']=="Y"].shape[0]/df['Loan_Status'].shape[0]
c2 = df[df['Loan_Status']=="N"].shape[0]/df['Loan_Status'].shape[0]
fig = plt.figure()
ax = fig.add_axes([0,0,1,1])
gender = ['Loan_granted%', 'Loan_rejected%']
loan_grant_dist = [c1,c2]
ax.bar(gender,loan_grant_dist,color = ['skyblue','orange'])
plt.show()
From above graph we can infer that, there significant difference in examples of two different classes. But no need to worry as minority class examples are not less than 5%.
sns.countplot(df['Gender'])
Graph shows represents that the total number of male applicants are significantly higher than female. One of the reason behind this could be the nature of our society. where in most of the cases male involved in financial decisions that's why there are lots of male applications for Loan.
#--------Loan_granted_and_not_granted_with_respect_to_gender----
Loan_granted = df[df['Loan_Status']== 'Y']
t1= Loan_granted[Loan_granted['Gender']=='Male'].shape[0]/sum(Loan_granted['Gender'].value_counts())
t2 = Loan_granted[Loan_granted['Gender']=='Female'].shape[0]/sum(Loan_granted['Gender'].value_counts())
After looking into the distribution of Loan status with gender we get that that the percentage of loan granted to male is very higher than female. This number helps us
to infer that the data has some bias towards certain gender category. Also we must consider the Gender in our selected features for mode building. This bias could be due to higher number of male applicants in the data. So let's check the fact....
#-----------Loan_granted_%_within_female_and_within_male----------
female_population = df[df['Gender']=='Female']
male_population = df[df['Gender']=='Male']
loan_grant_per_female = female_population[female_population['Loan_Status']== 'Y'].shape[0]/sum(female_population['Gender'].value_counts())
loan_grant_per_male = male_population[male_population['Loan_Status']== 'Y'].shape[0]/sum(male_population['Gender'].value_counts())
fig = plt.figure()
ax = fig.add_axes([0,0,1,1])
gender = ['Loan_granted_male%', 'Loan_rejected_female%']
loan_grant_dist = [loan_grant_per_male,loan_grant_per_female]
ax.bar(gender,loan_grant_dist,color = ['skyblue','orange'])
plt.show()
This graph makes information very clear that, if we consider loan_granted with respect to total number of female applicants is 66% and loan granted with respect to total number of male applicants is 69%. The huge gap between male and female loan_grant distribution in above graph was due to population bias. Still there is higher chance of getting loan if you are male applicant.
Let's look at the property area feature.
If we look at the percentage we can infer that if we move from Rural to urban area the probability of getting loan increases. The probability of getting loan sanctioned is higher for semiurban areas. The reason behind this could be the semiurban area are expected to have lot if growth potentials than Rural and Urban areas .
In the dependents variable distribution there is no such specific difference in distribution between number of examples belongs Yes and No class label.
loan_granted = df[df['Loan_Status']=='Y']
loan_not_granted = df[df['Loan_Status'] == 'N']
temp1 = list((loan_granted['Dependents'].value_counts()/loan_granted['Dependents'].shape[0]).index)
temp2 = list(loan_not_granted['Dependents'].value_counts()/loan_granted['Dependents'].shape[0])
sns.barplot(temp1,temp2)
temp1 = list((loan_not_granted['Dependents'].value_counts()/loan_granted['Dependents'].shape[0]).index)
temp2 = list(loan_not_granted['Dependents'].value_counts()/loan_granted['Dependents'].shape[0])
sns.barplot(temp1,temp2)
Look into married feature
temp1 = Loan_granted[Loan_granted['Married'] == 'Yes'].shape[0]/sum(Loan_granted['Married'].value_counts())
temp2 = Loan_granted[Loan_granted['Married'] == 'No'].shape[0]/sum(Loan_granted['Married'].value_counts())
fig = plt.figure()
ax = fig.add_axes([0,0,1,1])
list1 = ['Married%', 'Unmarried%']
loan_grant_dist = [temp1,temp2]
ax.bar(list1,loan_grant_dist,color = ['skyblue','orange'])
plt.show()
The data shows that if a person is married then there is higher probability of getting loan. The data distribution graph shows that, out of total number of loan_granted entries more than 65% applicants are married while only 32% of applicants are unmarried. The reason behind this could be, the risk carrying capacity of married people is lesser as compared to unmarried people. The married people are considered tobe more safe compared to unmarried ones.
Look into Education Feature
temp1 = Loan_granted[Loan_granted['Education'] == 'Graduate'].shape[0]/sum(Loan_granted['Education'].value_counts())
temp2 = Loan_granted[Loan_granted['Education'] == 'Not Graduate'].shape[0]/sum(Loan_granted['Education'].value_counts())
fig = plt.figure()
ax = fig.add_axes([0,0,1,1])
list1 = ['Graduate%', 'Not Graduate%']
loan_grant_dist = [temp1,temp2]
ax.bar(list1,loan_grant_dist,color = ['skyblue','orange'])
plt.show()
From here onwards all the features are numerical so, we will be using correlation as first check to see how each of our feature is correlated with each other as well as with our target variable.(a.k.a dependent variable).
If any features are correlated with other then there is presence of multicollinearity which one should try to ovoid. If features in dataset correlated with each other then it becomes very difficult to understand which feature is really contributing in predicting output.
The heatmap is type of plot which one can use to visualize correlation.
To visualize correlation plot first calculate correlation using following code:
temp = df.corr()
The above dataframe represent correlation information for each feature. Now lets visualize above information to get better understanding of it.
plt.figure(figsize = (10,8))
sns.heatmap(temp,annot=True)
The inference from above graph is there are two pair of features which having some significant amount of correlation.
1) [LoanAmount - ApplicantIncome]
2) [LoanAmount - CoapplicantIncome]
3) [ApplicantIncome - Dependents ]
Here from above graph we can infer following things:
There is positive correlation between LoanAmount and ApplicantIncome which shows that, the higher salaried person have higher demand of LoanAmount as it generate high income to pay it back. The one more reason is higher income person have higher spending budget that lower income person.
The LoanAmount is also positively correlated with CoapplicantIncome which means higher the income of coapplicant higher the demand of LoanAmount.
To justify this we can consider example of working husband and wife. When earning of
both increase their lifestyle, spending and financial investment budget also increase.
The third point shows that there is positive correlation between ApplicantIncome and Dependents. This information does make much sense because the feature Dependents
is categorical and ApplicantIncome is numerical. The feature Dependents though look like having numerical value but in reality is not continuous variable.
Univariate Numerical Analysis
Now let's look into univariate analysis of some numerical variables in the dataset.
Let's first look into Applicant income and CoapplicantIncome.
fig,(ax1,ax2) = plt.subplots(1,2,figsize = (15,5))
index = [i for i in range(0,df.shape[0])]
sns.regplot(index,df['ApplicantIncome'],ax=ax1)
sns.regplot(index,df['CoapplicantIncome'],ax=ax2)
fig,(ax1,ax2) = plt.subplots(1,2,figsize = (15,5))
sns.boxplot(df['ApplicantIncome'],ax=ax1)
sns.boxplot(df['CoapplicantIncome'],ax=ax2)
Looking at above plots we can say that there are outlier in the both the features.
If we take a close look into coapplicant income then we see the irregularities present in the dataset as it contain lots of entries as zero. These zero entries forcing the data distribution towards lower values.
Now lets look into distribution of these features to check the presence of skewness in it.
lets plot histogram for Applicant and Coapplicant income features.
sns.displot(df['ApplicantIncome'])
The distribution of above graph shows that the data in applicant income feature is not normally distributed. Which is one of the important requirement in linear ML modeling.
The characteristics of well distributed data is the mean and median of data should be close to each other. If mean and media resembles at same point then such data is ideal for building model.
Unfortunately for above that these normality condition does not holds. Before using this data we will need to do some preprocessing on it in order to make it useful for modeling purpose. Above data is called as right skew as it has long tail on right side.
Analytically if Mean > Median then such data is called as right skewed. conversely if Mean < Median then such data is called as left skewed.
sns.displot(df['CoapplicantIncome'])
The above data shows information about coapplicant income. If we see the pattern and distribution there is lot of commonality between applicant income and coapplicant income. The distribution of coapplicant income data is skew in right direction. which means Mean > Median. Alternatively long right tail confirms the same.
There are lot of zero values in coapplicant income which is forcing distribution to lower side. We can not use this data directly into the model it requires lots of preprocessing before using it.
Starting from categorical to numerical variables we did exploration of different aspects of our dataset. Till this point we have accomplished our goal of understanding dataset.
Exploratory analysis gave use useful information about the amount of preprocessing required on raw dataset. we will be using this information to preprocess our data and make it consumable in part2.
Very useful article..