top of page
Writer's picturebhavesh wadibhasme

Pima Indians data analysis

Updated: Apr 11, 2021



#------------import_packages-------------
import pandas as pd
import numpy as np
from matplotlib import pyplot
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report,precision_recall_curve,auc
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
# download the file
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
df = pd.read_csv(url ,names = names)

Read dataset from the online website using URL. This is one of the ways to read data using pandas. Read data from URL is a very robust way of inputting as it does not require storing data in a hard drive.

df.head()


Pandas head() method prints the first five rows of the data frame. This helps to get a basic overview of the dataset. One can generate some useful insights for a given dataset like the scale of each feature, nature of the target class, number of features with their names.

#-------------Information_of_data-----------------------
df.info()


info() function in pandas helps to understand basic information about the dataset. Looking at the above results of the info() function we can understand the types of features we are having. example: integer,numeric,character etc. This is very useful information for every data scientist. Based on this information a person can determine the strategy of preprocessing. One more important information this function providing is about null values in the dataset which is one of the core parts of data preprocessing.

#--------Summary_of_data---------------------------
df.describe()


describe() function helps to generate a statistical summary of the given dataset. Looking at the above summary data of the given Pima Indians dataset we can make a lot of assumptions about data. Testing of those assumptions can help to draws charts and understand the present behaviours of the dataset. Here the count parameter helps to determine the missing values in the dataset. Example: Here in this data we do not have any missing values present because all the count values are equal to several examples in the dataset. Similarly, other parameters like mean,standard deviation represent the central tendency of the dataset.

#-------------------missing_values_analysis----------------------
df.isnull().sum()


Checksum of all null values in the given dataset column-wise.

#---------------------Number_of_numeric_varables_and_categorcal_variables---------------
numeric_features = df.select_dtypes(include=['float64'])
int_features = df.select_dtypes(include=['int64'])
categorical_features  = df.select_dtypes(include=['object'])
boolian_features  = df.select_dtypes(include=['bool'])

Extract different data typed features subset from the given dataset. The above piece of code will help to understand the type of preprocessing method required to implement certain data features. But luckily our data carries all the features as numeric and without having any null values. Therefore instead of normalization, we do not require any other preprocessing here.

df.head()

target_ = df['class']
df_ = df.drop('class',axis=1)

X_train,X_test,y_train,y_test = train_test_split(df_,target_,test_size =0.20,random_state  = 42)
print(X_train.shape,X_test.shape,y_train.shape,y_test.shape)

Split the data into the train and test section. Train data used to build the model while test data is used to validate the model performance. This is normal practice in data machine learning model building.

model = LogisticRegression(max_iter=500)
model.fit(X_train,y_train)
# prepare the cross-validation procedure
cv = KFold(n_splits=10, random_state=42, shuffle=True)
scores = cross_val_score(model, df_, target_, scoring='accuracy', cv=cv, n_jobs=-1)
print(scores)
print('Accuracy: %.3f (%.3f)' % (np.mean(scores), np.std(scores)))
lr_pred = model.predict(X_test)
print(f1_score(y_test,lr_pred))
lr_probs = model.predict_proba(X_test)
lr_pred = model.predict(X_test)
lr_probs = lr_probs[:,1]
lr_precision, lr_recall, _ = precision_recall_curve(y_test, lr_probs)
lr_f1, lr_auc = f1_score(y_test, lr_pred), auc(lr_recall, lr_precision)
# summarize scores
print('Logistic: f1=%.3f auc=%.3f' % (lr_f1, lr_auc))
model_evaluation(lr_recall,lr_precision)
df = df.apply(lambda x:x/x.max(),axis=0)
df.head()

Normalize the given dataset in order to maintain a unique scale throughout the dataset. As we saw the raw data carries a different range of values in each feature. In such cases, the ML model may get biased towards higher range values in the decision process, specifically when we are using parametric models like linear regression, logistic regression, support vector machine, neural networks.

target = df['class']
df = df.drop('class',axis=1)
X_train,X_test,y_train,y_test = train_test_split(df,target,test_size =0.20,random_state = 42)
print(X_train.shape,X_test.shape,y_train.shape,y_test.shape)
model = LogisticRegression(max_iter=500)
model.fit(X_train,y_train)
# prepare the cross-validation procedure
cv = KFold(n_splits=10, random_state=42, shuffle=True)
scores = cross_val_score(model, df, target, scoring='accuracy', cv=cv, n_jobs=-1)
print(scores)
print('Accuracy: %.3f (%.3f)' % (np.mean(scores), np.std(scores)))
lr_pred = model.predict(X_test)
print(f1_score(y_test,lr_pred))
lr_probs = model.predict_proba(X_test)
lr_pred = model.predict(X_test)
lr_probs = lr_probs[:,1]
lr_precision, lr_recall, _ = precision_recall_curve(y_test, lr_probs)
lr_f1, lr_auc = f1_score(y_test, lr_pred), auc(lr_recall, lr_precision)
# summarize scores
print('Logistic: f1=%.3f auc=%.3f' % (lr_f1, lr_auc))
model_evaluation(lr_recall,lr_precision)

Comparing the above two models one without feature normalization/scaling and one with normalization/scaling we can clearly see the difference in their performance. The machine learning model which is build using feature normalization and scaling is performing better than the model build on the raw dataset.

The distribution of the given dataset is not perfectly balanced across both classes. The class1 carries almost 50% of class0, In such cases, model evaluation using precision and recall graphs is the best strategy.


72 views0 comments

Recent Posts

See All

Comments


bottom of page