Part 2 Machine Learning Simplelinear Regression

In the previous post we learned about data pre-processing. In this post we will review the simplest algorithm - Simple Linear Regression.

Business Problem: In this series, we are going to learn about Simple Linear Regression. We will review a data set about Salary and Experience in age. As a data scientist, your job is to help the HR department predict the salary of a person based on his years of experience if he or she accepts the job offer. If you offer too low, the person will not accept the job offer. If you offer too high, then you will be wasting the company’s resources($$).

Getting the dataset

What is Simple Linear regression?

If we draw a graph of salary vs Experience, you will see a linear trend.
linearGraph

Based on the graph, it is clear that this is a positive slope. If the co-efficient b1 is big, the slope is going to steeper, which means that if there is a small increase in the age, then there will be a big increase in the salary. If the value of b1 is small, the slope is going to be more gentle and with change is experience, the salary is going to increase gently.

How is a trendline determined by a model?

The model tries to find the trendline by determining the least of Sum of errors. In the below diagram, imagine the red cross as observed value and green cross as determined value(model prediction). The algorithm finds the difference between the and squares them. The algorithm does this for all the observed values and determines the slope of trendline which given the minimum of SUM(y - y^{^})²

trendline

Steps to solve the problem.

Step 1: Data Preprocessing

We will follow the same process as we did in the first series. We will import the libraries and read the csv file and take a look at the data.
dataset

1
2
3
4
5
6
7
8
9


# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Importing the dataset
dataset = pd.read_csv('../Salary_Data.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 1].values

Step 2: Split the data into training and test data

Since there are 30 rows, we will divide the data in 20:10(training:test)

1
2
3


#Split the data into training and test data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=1/3, random_state=0)

Step 3: Feature Scaling
We won’t be feature scaling or normalizing here because the library that will execute the model takes care of it. So we are going to skip this step.
Step 4: Train the model
At this point our data pre-processing is complete and we will use the library to run the regression on the training data set.

1
2
3
4


#Fittiing the Simple linear regression on the training set
from sklearn.linear_model import LinearRegression
regressor = LinearRegression( )
regressor.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In the above code, we have trained our model and saved it in the variable regressor.

Step 5: Predict and compare with the actual test set

Next, we are going to predict the values of the test sample and get the predicted results in variable y_pred.

1
2


# Predicting the salary of the test sample
y_pred = regressor.predict(X_test)

Let’s see how did the model predict.
predicted

Step 5: Plot the data

To better understand, let’s create plot the trendline of our model and look for two things

What is the fit of trendline for training data?
What is the fit of trendline on test data against predicted values?

1
2
3
4
5
6


plt.scatter(X_train, y_train, color='red')
plt.plot(X_train, regressor.predict(X_train), color='blue')
plt.title('Salary vs Experience (Training Set)')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.show()

png

Here’s how the graph would like for Training data

Couple of things to note. The scatter function just plots the x,y values on the graph as red dots

plt.scatter(X_train, y_train, color=‘red’)

The plot function also maps the data as blue dots but it joins them via line starting from first mapping to the last. Additionally, you will also notice that in the y-axis, we have passed model function predict(X_train) and not predict(x_test) because we want to use the function created using the training set - y=mx+c.

plt.plot(X_train, regressor.predict(X_train), color=‘blue’)

Plot vs Scatter

The primary difference of plt.scatter from plt.plot is that it can be used to create scatter plots where the properties of each individual point (size, face color, edge color, etc.) can be individually controlled or mapped to data.

The purpose of the above code is to show trendline against the training data points(red dots). If we wanted to show predicted line as set plotted dots, we would have used

plt.scatter(X_train, regressor.predict(X_train), color=‘blue’)

In that case the graph would have looked like this.

Here’s how the graph would like for Test data

1
2
3
4
5
6
7


#Plot the data against test set
plt.scatter(X_test, y_test, color='red')
plt.plot(X_train, regressor.predict(X_train), color='blue')
plt.title('Salary vs Experience (Test Set)')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.show()

png

As explained earlier, the trend line is what the model was trained against the training data set. Hence

plt.plot(X_train, regressor.predict(X_train), color=‘blue’)

and not

plt.plot(X_test, regressor.predict(X_test), color=‘blue’)

Now let’s create a more complete picture - the training set(RED DOTS), test set(GREEN DOTS) and predicted value (BLACK DOTS).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


#Plot the data against test set
plt.scatter(X_train, y_train, color='red')
plt.scatter(X_test, y_test, color='green')
plt.scatter(X_test, y_pred, color='black')

plt.plot(X_train, regressor.predict(X_train), color='blue')
plt.title('Salary vs Experience (Test Set)')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.show()

png

So you will notice that all the black dots lie on the trend line because they are created using the trendline equation.

That’s it. We just trained a model using Simple linear regression and predicted the values, then compared it against test data.