# Part 7 Decision tree regression or classification using python

2019-03-03
` 1645 words `

` 8 mins read `

In the previous post, we learnt about Support vector regression. In this post, we will see a new way of deciphering information using a simple format of traversing conditions.

**Business Goal**: Can you spot the king? The people of *Falkland* are scared. Their king disguises as a common man and roams among them to gain knowledge about his kingdom and see if his policies are working in his kingdom. When the king is disguised, the common people don’t recognize him. If they accidentally mistreat the king when he is disguised, they get punished. Can you help the people of *Falkland* spot the king?

### How to get dataset?

### What is a “Decision Tree”‘“?

A Decision tree builds regression or classification models in the form of tree structure. It is a set of ‘yes’ or ‘no’ flow, which cascades downward like an upside down tree. For example, given a set of independent variables or features about a person, can we find if the person is healthy.

### Parts of decision Tree

- Each decision point is called a
**Node**. Ex -**Age < 30** - Each connector is called an
**Edge**. - Each node which does not have any subnode is called a
**Leaf**. Ex -**Fit**or**Unfit!**.

### How is the tree built?

To build a tree, we need to start with an *Independent Feature* as a root node. The possible attributes or unique values of that feature form the edges. Once the first level of the tree is completed, attach another feature node at the end of each node and traverse deeper. Once you have exhausted all the features, you will arrive at the dependent value or result.

### Can we just start with any random feature as the root node?

This is a million $$$ question here. This is the meat of the whole algorithm. Let’s look at our business problem about the problem that people of *Falkland* are facing. We need to come up with a solution to spot the king when he is disguised to save the common man from mistreating him accidentally and hence punished in return. Here’s the data that we have collected about people leaving the castle.

Ok! So we have the data, but how do we find out which feature will be the root node?

Going back to our previous post on **Backward Elimination**, we can gather that the root node should be a feature which is the most important feature in making the decision. To find the most important feature, we will align each independent feature with dependent feature (*Is_King*).

If we look at the above mapping, we will see that **Gold_Tooth** feature is right most of the time in predicting the king, followed by the **castle** as it has the least number of false positive.

Well, that’s good to know, but I noticed that you did talk about the last two features -

GreedyandSlow.

Yes, the distinction between the two is difficult to figure out. Both **Greedy** and **Slow** features have an equal number of false positives. To understand, which feature is more important than the other, we need to understand **Data Entropy**_.

### What is Data Entropy?

Entropy means how many times information changed that we got a positive result. Imagine if the king never left the castle, which means that all the information that we collected will show **Is_King** as **0**. In our case, the entropy is **1** because anybody could be the king. If we just had **Castle** as the feature, predicting the king would be difficult without another piece of information.
>So in simple terms **Entropy** is how many pieces of the data point(*Independent feature*) is required, to guess the *Dependent variable* - *Is_King*__

To further explain. Let’s say that instead of starting with **Gold_tooth** as the root node, we start with the **castle**. We will see that we are able to find the king only ** ^{3}⁄_{10}** times. On top of that, the left side gives very poor results. Just

**or**

^{1}⁄_{5}**20%**.

There is another problem with the above tree. It is too overcomplicated and is **overfitted**. If we get new data the accuracy of our model could fall drastically.

Going back to our learning in the earlier post, the simpler model should be preferred over the complicated model to avoid overfitting.

### How to avoid overfitting in decision Tree?

Just remember the 3 golden rules to avoid overfitting:

Use a smaller number of data points to build the tree. Ex - 10% of data points is a good place to build a generic model.

Do not go overboard with the depth of the tree. A tree depth should only be increased if there is a significant improvement in the prediction.

Stop, if the number of data points at the split is less than 5%.

Here’s a refined version of the tree.

### Would this model work on non-categorical or continuous values?

Absolultely!! The splitting rules would still apply as I mentioned above.

So each Split is a leaf node above. Imagine if we wanted to find the dependent variable **Y** whose independent partners **X1** and **X2** are **10** and **150**, then it would land in the first node as **300.5**.

I get why it landed in first leaf node position but where did we get value

300.5?

The value 300.5 is the average of all the data points in that box.

Pay attention and read the previous 2 lines again.The last two lines will help you understand why we need to divide it into different leaves and nodes. If you do not have splits, then the only option is to take the average of theALLthe data points!! The accuracy would be nowhere close to your expectation and would be same all values of X1 and X2.

### Python Implementation

We are going to take a standard dataset called IRIS Dataset >“The Iris flower dataset or Fisher’s Iris dataset is a multivariate dataset introduced by the British statistician and biologist Ronald Fisher in his 1936 paper ‘The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis’.” — Wikipedia

In layman terms, it is a set of data points about IRIS flower where we have the information about the length and the width of sepals and petals about 3 varieties.

**Step 1:** Get the common imports

```
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
dataset = pd.read_csv("iris.csv")
```

**Step 2:** Identify the missing data

```
dataset.isnull().any()
```

```
sepal-length False
sepal-width False
petal-length False
petal-width False
species False
dtype: bool
```

**Step 3:** Describe the data and identify the data types

```
dataset.describe()
```

sepal-length | sepal-width | petal-length | petal-width | |
---|---|---|---|---|

count | 150.000000 | 150.000000 | 150.000000 | 150.000000 |

mean | 5.843333 | 3.054000 | 3.758667 | 1.198667 |

std | 0.828066 | 0.433594 | 1.764420 | 0.763161 |

min | 4.300000 | 2.000000 | 1.000000 | 0.100000 |

25% | 5.100000 | 2.800000 | 1.600000 | 0.300000 |

50% | 5.800000 | 3.000000 | 4.350000 | 1.300000 |

75% | 6.400000 | 3.300000 | 5.100000 | 1.800000 |

max | 7.900000 | 4.400000 | 6.900000 | 2.500000 |

```
dataset.dtypes
```

```
sepal-length float64
sepal-width float64
petal-length float64
petal-width float64
species object
dtype: object
```

**Step 4:** Load the Iris data and create the X and Y variables

```
X= dataset.iloc[0:, 0:4].values
Y = dataset.iloc[:,4]
```

**Step 5:** Plot the data

```
##Get the dataset for each of the three species
setosa=dataset[dataset['species']=='Iris-setosa']
versicolor =dataset[dataset['species']=='Iris-versicolor']
virginica =dataset[dataset['species']=='Iris-virginica']
#Create an empty figure with two windows pf size 21 by 10
plt.figure()
fig,ax=plt.subplots(1,2,figsize=(21, 10))
#Plot each species using Sepal length and width on x-y axis
setosa.plot(x="sepal-length", y="sepal-width", kind="scatter",ax=ax[0],label='setosa',color='r')
versicolor.plot(x="sepal-length",y="sepal-width",kind="scatter",ax=ax[0],label='versicolor',color='b')
virginica.plot(x="sepal-length", y="sepal-width", kind="scatter", ax=ax[0], label='virginica', color='g')
#Plot each species using Petal length and width on x-y axis
setosa.plot(x="petal-length", y="petal-width", kind="scatter",ax=ax[1],label='setosa',color='r')
versicolor.plot(x="petal-length",y="petal-width",kind="scatter",ax=ax[1],label='versicolor',color='b')
virginica.plot(x="petal-length", y="petal-width", kind="scatter", ax=ax[1], label='virginica', color='g')
#Give Each figure its names
ax[0].set(title='Sepal comparasion ', ylabel='sepal-width')
ax[1].set(title='Petal Comparasion', ylabel='petal-width')
ax[0].legend()
ax[1].legend()
#Show the plot
plt.show()
```

```
<Figure size 432x288 with 0 Axes>
```

**Step 6:** Encode the value of Flower types
The values of dependent the variable needs to be encoded to numbers as they are categorical values

```
from sklearn.preprocessing import LabelEncoder
labelEncoder_y = LabelEncoder()
Y = labelEncoder_y.fit_transform(Y)
```

**Step 7:** Split the data in training and test set

```
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X,Y, test_size=0.2, random_state=0)
```

**Step 8:** Train the Decision Tree model

```
from sklearn.tree import DecisionTreeRegressor
regressor = DecisionTreeRegressor(random_state=0)
regressor.fit(x_train, y_train)
```

```
DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
max_leaf_nodes=None, min_impurity_decrease=0.0,
min_impurity_split=None, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
presort=False, random_state=0, splitter='best')
```

**Step 9:** Predict and score the model

```
y_predict=regressor.predict(x_test)
print(regressor.score(x_test,y_test))
```

```
1.0
```

Wow! Did we just predict that our model is correct 100% of the time?

The reason the accuracy is showing 100% is that our model is too complex as we did not define the maximum depth of tree and hence we broke a cardinal rule. Let’s take a look at the created tree.

```
from sklearn.externals.six import StringIO
from IPython.display import Image
from sklearn.tree import export_graphviz
import pydotplus
from sklearn import tree
dot_data = StringIO()
tree.export_graphviz(regressor, out_file=dot_data)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
Image(graph.create_png())
```

As you can see, that since we did not provide a maximum depth of the tree, it created a complex tree of 6 layers and hence for our model we are getting 100% accuracy. This means that the model is an **overfitted model**.

Let’s fix this by creating a simpler model.

```
#Creating a model that is only 2 layers deep by setting max_depth=3
regressor = DecisionTreeRegressor(random_state=0,criterion='mse', splitter='best', max_depth=3, min_samples_split=3, min_samples_leaf=2 )
regressor.fit(x_train, y_train)
y_predict=regressor.predict(x_test)
print(regressor.score(x_test,y_test))
```

```
0.9739827477382705
```

As you can see that the model is not an overfit anymore and still gives us pretty good accuracy of

97.4%.

Looking at the decision tree now.

```
from sklearn.externals.six import StringIO
from IPython.display import Image
from sklearn.tree import export_graphviz
import pydotplus
from sklearn import tree
dot_data = StringIO()
tree.export_graphviz(regressor, out_file=dot_data)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
Image(graph.create_png())
![tree_fix.PNG](/resources/img/decision/tree_fix.PNG)
```

So keep climbing the tree of success with this DecisionTree regression model. In the next series, we will see how to use a kind of decision tree called Random forest regression.

## Related Articles:

- 2019/02/16 Part 6 Support Vector Regression
- 2019/02/11 Part 5 Machine Learning Backward Elimination
- 2019/02/03 Part 4 Machine Learning Multiple Regression
- 2019/01/27 Part 3 Machine Learning Understanding P Value
- 2019/01/22 Part 2 Machine Learning Simplelinear Regression