Finding Best Machine Learning Model

Machine Learning

Machine learning is an artificial intelligence (AI) discipline geared toward the technological development of human knowledge. Machine learning allows computers to handle new situations via analysis, self-training, observation and experience

Supervised Machine Learning:

Supervised learning as the name indicates the presence of a supervisor as a teacher. Basically supervised learning is a learning in which we teach or train the machine using data which is well labeled that means some data is already tagged with the correct answer

Supervised learning classified into two categories of algorithms:

  • Classification: A classification problem is when the output variable is a category, such as “Red” or “blue” or “disease” and “no disease”.

Regression vs Classification

The main difference between them is that the output variable in regression is numerical (or continuous) while that for classification is categorical (or discrete).

Note: We will be using Regression Supervised Learning in this tutorial.

Unsupervised Machine Learning:

Unsupervised learning is the training of machine using information that is neither classified nor labeled and allowing the algorithm to act on that information without guidance. Here the task of machine is to group unsorted information according to similarities, patterns and differences without any prior training of data.

Coding Begins…..

Loading Dataset


you need to first import the pandas to read csv file. I have taken the dataset from kaggle. You can have a look at the data by clicking here. I have stored the attributes which will be using later

Suggestion: if there is a module error then use pip install model_name

This is how my data looks. It has some categorical value and some continuous value.

Data Preprocessing

Data Preprocessing is a technique that is used to convert the raw data into a clean data set. In other words, whenever the data is gathered from different sources it is collected in raw format which is not feasible for the analysis.

There are different techniques used in data preprocessing. You need to find which one you have to apply for your dataset by analyzing it.

Well, there is one column(i.e bmi) containg NaN value which is basically missing data. We need to replace NaN value to some meaningful value.


imputation is the process of replacing missing data with substituted values.

there are different strategies for imputation but we will be sticking to replacing values with mean of the attribute.

You need to first import Imputer. You can see the NaN value is replaced with mean(i.e 30.666753).

Label Encoding

Label Encoding refers to converting the labels into numeric form so as to convert it into the machine-readable form. Machine learning algorithms can then decide in a better way on how those labels must be operated. It is an important pre-processing step for the structured dataset in supervised learning.

Attribute sex, smoker, region has categorical data. we need to encode this data to numerical values.

Why I have used Label Encoding?

Well there are very few unique string in all three attributes. so we can assign them labels directly like male:0, female:1.

Now all the attributes have numerical values.


In our data we have different range of values for all attributes. In this case the attribute having very high value will have higher impact on the predicted value than the attribute having smaller value which is not acceptable. So we are using Normalization, in which we will convert all the attributes to a range from 0 to 1.

Remember, we just need to normalize the independent variable. So I have split the data into independent variable x and dependent variable y.

Normalization task is done, but the attributes of original data has changed. So we need to assign them back.

Now concatenate independent and dependent variable.

Now we have completed the data pre-processing part.

Splitting the Data

We have to split the data into training data and testing data. so that we can use training data to train our model and testing data to test our model.

Training & Testing different ML models

we will be using different regression models to find the best model by comparing the accuracy.

Linear Regression


First, We have to import Linear Regression from sklearn and then train on training dataset and find the accuracy of the model on testing dataset. In linear regression our testing accuracy is 0.5962 which is not so good.


This is a form of regression, that constrains/ regularizes or shrinks the coefficient estimates towards zero. In other words, this technique discourages learning a more complex or flexible model, so as to avoid the risk of overfitting.

Ridge Regularization(alpha=0.001)


First, We have to import Ridge from sklearn.linear_model and then train on training dataset with alpha=0.0001 and find the accuracy of the model on testing dataset. In Ridge regularization our testing accuracy is 0.5960.

Ridge Regression(alpha=0.1)


First, We have to import Ridge from sklearn.linear_model and then train on training dataset with alpha=0.1 and find the accuracy of the model on testing dataset. In Ridge regularization our testing accuracy is 0.401.

Decision Tree


well, the accuracy has increased.

Random Forest


accuracy has increased significantly.

MPL Regressor


the accuracy is less than the accuracy of random forest


After using multiple regression models and analyzing the accuracy We found that Random Forest is the best Model for our Data with an accuracy of 0.847.