Part 7 Decision tree regression or classification using python

In the previous post, we learnt about Support vector regression. In this post, we will see a new way of deciphering information using a simple format of traversing conditions.

Business Goal: Can you spot the king? The people of Falkland are scared. Their king disguises as a common man and roams among them to gain knowledge about his kingdom and see if his policies are working in his kingdom. When the king is disguised, the common people don’t recognize him. If they accidentally mistreat the king when he is disguised, they get punished. Can you help the people of Falkland spot the king?

How to get dataset?

What is a “Decision Tree”’"?

decsionTree

A Decision tree builds regression or classification models in the form of tree structure. It is a set of ‘yes’ or ’no’ flow, which cascades downward like an upside down tree. For example, given a set of independent variables or features about a person, can we find if the person is healthy.

Parts of decision Tree

Each decision point is called a Node. Ex - Age < 30
Each connector is called an Edge.
Each node which does not have any subnode is called a Leaf. Ex - Fit or Unfit!.

How is the tree built?

To build a tree, we need to start with an Independent Feature as a root node. The possible attributes or unique values of that feature form the edges. Once the first level of the tree is completed, attach another feature node at the end of each node and traverse deeper. Once you have exhausted all the features, you will arrive at the dependent value or result.

Can we just start with any random feature as the root node?

This is a million $$$ question here. This is the meat of the whole algorithm. Let’s look at our business problem about the problem that people of Falkland are facing. We need to come up with a solution to spot the king when he is disguised to save the common man from mistreating him accidentally and hence punished in return. Here’s the data that we have collected about people leaving the castle.

Ok! So we have the data, but how do we find out which feature will be the root node?

Going back to our previous post on Backward Elimination, we can gather that the root node should be a feature which is the most important feature in making the decision. To find the most important feature, we will align each independent feature with dependent feature (Is_King).

If we look at the above mapping, we will see that Gold_Tooth feature is right most of the time in predicting the king, followed by the castle as it has the least number of false positive.

Well, that’s good to know, but I noticed that you did talk about the last two features - Greedy and Slow.

Yes, the distinction between the two is difficult to figure out. Both Greedy and Slow features have an equal number of false positives. To understand, which feature is more important than the other, we need to understand Data Entropy_.

What is Data Entropy?

Entropy means how many times information changed that we got a positive result. Imagine if the king never left the castle, which means that all the information that we collected will show Is_King as 0. In our case, the entropy is 1 because anybody could be the king. If we just had Castle as the feature, predicting the king would be difficult without another piece of information.

So in simple terms Entropy is how many pieces of the data point(Independent feature) is required, to guess the Dependent variable - Is_King__

To further explain. Let’s say that instead of starting with Gold_tooth as the root node, we start with the castle. We will see that we are able to find the king only 3/10 times. On top of that, the left side gives very poor results. Just 1/5 or 20%.

There is another problem with the above tree. It is too overcomplicated and is overfitted. If we get new data the accuracy of our model could fall drastically.

Going back to our learning in the earlier post, the simpler model should be preferred over the complicated model to avoid overfitting.

How to avoid overfitting in decision Tree?

Just remember the 3 golden rules to avoid overfitting:

Use a smaller number of data points to build the tree. Ex - 10% of data points is a good place to build a generic model.
Do not go overboard with the depth of the tree. A tree depth should only be increased if there is a significant improvement in the prediction.
Stop, if the number of data points at the split is less than 5%.

Here’s a refined version of the tree.

Would this model work on non-categorical or continuous values?

Absolultely!! The splitting rules would still apply as I mentioned above.

So each Split is a leaf node above. Imagine if we wanted to find the dependent variable Y whose independent partners X1 and X2 are 10 and 150, then it would land in the first node as 300.5.

I get why it landed in first leaf node position but where did we get value 300.5?
The value 300.5 is the average of all the data points in that box.

Pay attention and read the previous 2 lines again. The last two lines will help you understand why we need to divide it into different leaves and nodes. If you do not have splits, then the only option is to take the average of the ALL the data points!! The accuracy would be nowhere close to your expectation and would be same all values of X1 and X2.

Python Implementation

We are going to take a standard dataset called IRIS Dataset

“The Iris flower dataset or Fisher’s Iris dataset is a multivariate dataset introduced by the British statistician and biologist Ronald Fisher in his 1936 paper ‘The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis’.” — Wikipedia

In layman terms, it is a set of data points about IRIS flower where we have the information about the length and the width of sepals and petals about 3 varieties.

Step 1: Get the common imports

1
2
3
4
5
6


# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
dataset = pd.read_csv("iris.csv")

Step 2: Identify the missing data

1

dataset.isnull().any()

sepal-length    False
sepal-width     False
petal-length    False
petal-width     False
species         False
dtype: bool

Step 3: Describe the data and identify the data types

1

dataset.describe()

	sepal-length	sepal-width	petal-length	petal-width
count	150.000000	150.000000	150.000000	150.000000
mean	5.843333	3.054000	3.758667	1.198667
std	0.828066	0.433594	1.764420	0.763161
min	4.300000	2.000000	1.000000	0.100000
25%	5.100000	2.800000	1.600000	0.300000
50%	5.800000	3.000000	4.350000	1.300000
75%	6.400000	3.300000	5.100000	1.800000
max	7.900000	4.400000	6.900000	2.500000

1

dataset.dtypes

sepal-length    float64
sepal-width     float64
petal-length    float64
petal-width     float64
species          object
dtype: object

Step 4: Load the Iris data and create the X and Y variables

1
2
3



X= dataset.iloc[0:, 0:4].values
Y = dataset.iloc[:,4]

Step 5: Plot the data

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29




##Get the dataset for each of the three species
setosa=dataset[dataset['species']=='Iris-setosa']
versicolor =dataset[dataset['species']=='Iris-versicolor']
virginica =dataset[dataset['species']=='Iris-virginica']

#Create an empty figure with two windows pf size 21 by 10
plt.figure()
fig,ax=plt.subplots(1,2,figsize=(21, 10))

#Plot each species using Sepal length and width on x-y axis
setosa.plot(x="sepal-length", y="sepal-width", kind="scatter",ax=ax[0],label='setosa',color='r')
versicolor.plot(x="sepal-length",y="sepal-width",kind="scatter",ax=ax[0],label='versicolor',color='b')
virginica.plot(x="sepal-length", y="sepal-width", kind="scatter", ax=ax[0], label='virginica', color='g')

#Plot each species using Petal length and width on x-y axis
setosa.plot(x="petal-length", y="petal-width", kind="scatter",ax=ax[1],label='setosa',color='r')
versicolor.plot(x="petal-length",y="petal-width",kind="scatter",ax=ax[1],label='versicolor',color='b')
virginica.plot(x="petal-length", y="petal-width", kind="scatter", ax=ax[1], label='virginica', color='g')

#Give Each figure its names
ax[0].set(title='Sepal comparasion ', ylabel='sepal-width')
ax[1].set(title='Petal Comparasion',  ylabel='petal-width')
ax[0].legend()
ax[1].legend()

#Show the plot
plt.show()

<Figure size 432x288 with 0 Axes>

Step 6: Encode the value of Flower types
The values of dependent the variable needs to be encoded to numbers as they are categorical values

1
2
3


from sklearn.preprocessing import LabelEncoder
labelEncoder_y = LabelEncoder()
Y = labelEncoder_y.fit_transform(Y)

Step 7: Split the data in training and test set

1
2


from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X,Y, test_size=0.2, random_state=0)

Step 8: Train the Decision Tree model

1
2
3


from sklearn.tree import DecisionTreeRegressor
regressor = DecisionTreeRegressor(random_state=0)
regressor.fit(x_train, y_train)

DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=0, splitter='best')

Step 9: Predict and score the model

1
2
3


y_predict=regressor.predict(x_test)

print(regressor.score(x_test,y_test))

1.0

Wow! Did we just predict that our model is correct 100% of the time?
The reason the accuracy is showing 100% is that our model is too complex as we did not define the maximum depth of tree and hence we broke a cardinal rule. Let’s take a look at the created tree.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


from sklearn.externals.six import StringIO
from IPython.display import Image
from sklearn.tree import export_graphviz
import pydotplus
from sklearn import tree

dot_data = StringIO()  
tree.export_graphviz(regressor, out_file=dot_data)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
Image(graph.create_png())

As you can see, that since we did not provide a maximum depth of the tree, it created a complex tree of 6 layers and hence for our model we are getting 100% accuracy. This means that the model is an overfitted model.

Let’s fix this by creating a simpler model.

1
2
3
4
5
6
7


#Creating a model that is only 2 layers deep by setting max_depth=3
regressor = DecisionTreeRegressor(random_state=0,criterion='mse', splitter='best', max_depth=3, min_samples_split=3, min_samples_leaf=2 )
regressor.fit(x_train, y_train)

y_predict=regressor.predict(x_test)

print(regressor.score(x_test,y_test))

0.9739827477382705

As you can see that the model is not an overfit anymore and still gives us pretty good accuracy of 97.4%.
Looking at the decision tree now.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12


from sklearn.externals.six import StringIO
from IPython.display import Image
from sklearn.tree import export_graphviz
import pydotplus
from sklearn import tree

dot_data = StringIO()  
tree.export_graphviz(regressor, out_file=dot_data)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
Image(graph.create_png())

![tree_fix.PNG](/images/decision/tree_fix.PNG)

So keep climbing the tree of success with this DecisionTree regression model. In the next series, we will see how to use a kind of decision tree called Random forest regression.