# Part 2 Machine Learning Simplelinear Regression

2019-01-22
` 928 words `

` 5 mins read `

In the previous post we learned about data pre-processing. In this post we will review the simplest algorithm - Simple Linear Regression.

**Business Problem**: In this series, we are going to learn about Simple Linear Regression. We will review a data set about `Salary`

and `Experience in age`

. As a data scientist, your job is to help the HR department predict the salary of a person based on his years of experience if he or she accepts the job offer. If you offer too low, the person will not accept the job offer. If you offer too high, then you will be wasting the company’s resources($$).

### Getting the dataset

### What is Simple Linear regression?

If we draw a graph of `salary`

vs `Experience`

, you will see a linear trend.

Based on the graph, it is clear that this is a positive slope. If the co-efficient `b1`

is big, the slope is going to steeper, which means that if there is a small increase in the age, then there will be a big increase in the salary. If the value of `b1`

is small, the slope is going to be more gentle and with change is experience, the salary is going to increase gently.

### How is a trendline determined by a model?

The model tries to find the trendline by determining the least of `Sum of errors`

. In the below diagram, imagine the `red cross`

as observed value and `green cross`

as determined value(model prediction). The algorithm finds the difference between the and squares them. The algorithm does this for all the observed values and determines the slope of trendline which given the minimum of **SUM(y - y ^{^})^{2}**

### Steps to solve the problem.

**Step 1: Data Preprocessing**

We will follow the same process as we did in the first series. We will import the libraries and read the csv file and take a look at the data.

```
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset = pd.read_csv('../Salary_Data.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 1].values
```

**Step 2: Split the data into training and test data**Since there are 30 rows, we will divide the data in 20:10(training:test)

```
#Split the data into training and test data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=1/3, random_state=0)
```

**Step 3: Feature Scaling**We won’t be feature scaling or normalizing here because the library that will execute the model takes care of it. So we are going to skip this step.**Step 4: Train the model**At this point our data pre-processing is complete and we will use the library to run the regression on the training data set.

```
#Fittiing the Simple linear regression on the training set
from sklearn.linear_model import LinearRegression
regressor = LinearRegression( )
regressor.fit(X_train, y_train)
```

```
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
normalize=False)
```

In the above code, we have trained our model and saved it in the variable `regressor`

.

**Step 5: Predict and compare with the actual test set**

Next, we are going to predict the values of the test sample and get the predicted results in variable `y_pred`

.

```
# Predicting the salary of the test sample
y_pred = regressor.predict(X_test)
```

Let’s see how did the model predict.

**Step 5: Plot the data**

To better understand, let’s create plot the trendline of our model and look for two things 1. What is the fit of trendline for training data? 2. What is the fit of trendline on test data against predicted values?

```
plt.scatter(X_train, y_train, color='red')
plt.plot(X_train, regressor.predict(X_train), color='blue')
plt.title('Salary vs Experience (Training Set)')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.show()
```

Here’s how the graph would like for **Training data**
Couple of things to note. The scatter function just plots the x,y values on the graph as red dots
> plt.scatter(X_train, y_train, color=‘red’)

The plot function also maps the data as blue dots but it joins them via line starting from first mapping to the last. Additionally, you will also notice that in the y-axis, we have passed model function `predict(X_train)`

and **not** `predict(x_test)`

because we want to use the function created using the training set - `y=mx+c`

.
> plt.plot(X_train, regressor.predict(X_train), color=‘blue’)

#### Plot vs Scatter

The primary difference of plt.scatter from plt.plot is that it can be used to create scatter plots where the properties of each individual point (size, face color, edge color, etc.) can be individually controlled or mapped to data.

The purpose of the above code is to show trendline against the training data points(red dots). If we wanted to show predicted line as set plotted `dots`

, we would have used
>plt.scatter(X_train, regressor.predict(X_train), color=‘blue’)

In that case the graph would have looked like this.

Here’s how the graph would like for **Test data**

```
#Plot the data against test set
plt.scatter(X_test, y_test, color='red')
plt.plot(X_train, regressor.predict(X_train), color='blue')
plt.title('Salary vs Experience (Test Set)')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.show()
```

As explained earlier, the trend line is what the model was trained against the training data set. Hence > plt.plot(X_train, regressor.predict(X_train), color=‘blue’)

and **not**
> plt.plot(X_test, regressor.predict(X_test), color=‘blue’)

Now let’s create a more complete picture - the training set(RED DOTS), test set(GREEN DOTS) and predicted value (BLACK DOTS).

```
#Plot the data against test set
plt.scatter(X_train, y_train, color='red')
plt.scatter(X_test, y_test, color='green')
plt.scatter(X_test, y_pred, color='black')
plt.plot(X_train, regressor.predict(X_train), color='blue')
plt.title('Salary vs Experience (Test Set)')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.show()
```

So you will notice that all the `black`

dots lie on the trend line because they are created using the trendline equation.

That’s it. We just trained a model using Simple linear regression and predicted the values, then compared it against test data.

```
```

## Related Articles:

- 2019/01/21 Part 1 Machine Learning Data Preprocessing