Tree Based Decision Model for Classification — Practical Implementation

Rodrigo Dutcosky
8 min readJun 1, 2020

Hello there!

Welcome to another post from my Practical Implementation Series.

Before anything, I'll leave the link for the post of the Algorithms I showed the implementation so far. If you enjoy this one please don't forget to check them all after!

Decision Tree Models

Decision Tree Based Models are supervised algorithms that are capable to predict continuous values or classify data based on a given label.

The classification usage can be either from a boolean target or multiple labels.

One of my favorite things about this Machine Learning technique is that the practical use of it could not be more close from the definition of Algorithms itself!

My definition of Algorithm is,

A finite sequence of previously defined instructions set in a logical order with the purpose of problem solving.

Let me give you an example of what I mean by that.

I personally have a pretty good ideia of what a good start of a day means for me. So, to classify my daily mornings as good or bad all I have to do is pre-define what has to happen for so.

  1. Am I waking up too early? What is early for me?
  2. Is it cold? What do I mean by cold?
  3. Did I had enough hour of sleep?
  4. If I had enough sleep, is waking up early still a bad thing?
  5. Do I have breakfast on my home or I'll have to grab on my way to work? Which of these two I like better?

You get the idea…

If I feed all this information to train a Model, I'll be able to predict if I'll have a good or bad morning based on new data entries.

Tree Based Decision Classifiers work by creating a tree and split all this data into nodes. Every node creation aims for the best accuracy of the labels classification.

How are the Tree nodes created?

Well, if you have a well structured dataset you basically have to call the function. The programming miracle will take care of the rest for you.

If you don't parse any parameters on the function, your model will be created by using the default ones. But let me tell you about the most common parameters known when using this type of Algorithm.

  1. You can select the depth of your tree. This represents how far your tree will go. If you leave this parameter as default, the model will keep splitting your data into nodes until it can't do it anymore. This is a good parameters to play with when you suspect of overfitting and there are some fancy methods to help you make this choice called pruning.
  2. You can determine everything about the features distribution when creating the tree. Also, minimum and maximum amount of samples (data rows) to create a new intern node/make a new split.
  3. You can choose which criteria your model will use to distribute/split your data. Let me tell a little bit more about that.

Data Distribution Methods

Entropy

This criteria will calculate the information gain from each feature/values of your data set. To exemplify it, let's assume that the model found out that waking up early has a high importance on predictions of good/bad mornings for me.

I gave to my model 1.000 registers of information to be trained, and it found out that 90% of the time I had a good morning I didn't had to wake up early.

This means that the time I have to wake up has a very high importance to make correct predictions. If the model use this feature as the master node of the tree, it means the first split will have lost the minimum possible amount of information over the target.

The tree will keep creating the following nodes always based on: where can I split all this data next and keep the most of the target labels together?

Tree Based Models are also one (of many) ways to understand the features importance to correct classify a given target.

I will show you how when we reach the practical part of this post!

Gini

This criteria will make your model make the splits based on the Gini Impurity calculation from each feature. The level of impurity of a feature can range form 0 to 1.

Impurity is the probability of a variable be classified wrong, if choose randomly.

I'll use another example of my morning decisions to simplify it.

Let's say the model is going to calculate the Gini Impurity of the weather temperature feature. When I had a good morning, how many times it was cold or not cold?

I hate cold morning ok? So..

  1. Calculation for this feature would be 0 if 100% of the times I had a good morning was not cold.
  2. Calculation for this feature would be 0.5 if the values for this feature are equally distributed based on the target. Since being cold is only a yes/no feature, half the times was cold I had a good morning, and half the times was cold I didn’t.
  3. Calculations for this feature would be 1 if the model understands that this feature values are randomly distributed based on target.

The Gini Impurity method will also be calculated to all features before any split happen.

This is the default value for the criteria parameter of the function and from my brief experience using Tree Based Models it usually works best.

Let's get to Work

To show you the practical implementation of Tree Decision Classifiers I picked a small dataset from Kaggle which contains information about 101 animals from a Zoo and their respectively classes.

You can find this dataset on the link below:

You'll find that the animal's class is on a different CSV file, so the first thing I'll do is get everything together.

On the dataset, we have some boolean features mixed up with some continuous features as well.

Now I'm going to separate into two different arrays which are the features that will be used to construct the tree and which are the labels.

Let's call the DecisionTreeClassifier() function from sklearn.tree library and create the Model object on a variable called DTC.

After that, let's divide our dataset on 75% for training and 25% for testing.

Notice I only changed the parameter criterion because the default is Gini. I'm doing that do compare the accuracy from both.

Feel free to explore all the parameters on this function on the official documentation below!

To train the model, just need to call the good old function fit(). On the next lines of code I'll also make the predictions using the test dataset and print the score using the test labels.

The Tree Model Classifier had an accuracy of 92.3% using Entropy split criteria.

To test the model accuracy using Gini Impurity, the only thing I need to do is change the criteria for gini and run the same code again.

Doing that, the Model had an accuracy of 96.5%.

Features Importances

Remember I said Tree Based Models are good to understand features importances as well?

You can print each feature importance calculated by the model with the following command:

But that just gives you the percentual of importance from each feature in the dataset, you can't even see which is which.

I developed this piece of code to make things more clear during the analysis. Feel free to put it on your tool box.

Much better. It's crazy seeing that seven features had no importance on the creating of the model using Entropy.

Let's try to raise up that 92.3% performance by removing the least features importance from the dataset.

So hair, backbone, predator, tail, legs, milk and toothed will be dropped out. After that, I'll use a loop to remove the other four least important features remaining, one by one. I will compute the accuracy of every loop while doing it.

Besides dropping all 0% importance features, it looks like I reached the same performance when dropped domestic and catsize as well.

Dropping feathers made the model have a 73% accuracy but when I also dropped fins the accuracy raised up to 80%.

I made this loop to show that is not as simple as dropping everything that is not influent on that particular model. The features importance will depend on the combination of features you're giving the model in first place. Maybe feature X is not relevant in one given set of labels, but could get more relevant if features Y and Z weren't there.

But it's all good. There's another way we can boost our Model accuracy using Entropy.

GridSearchCV()

From sklearn.model_selection you can import GridSearchCV(), a very powerful function to have in hand when training models in Python.

GridSearch is used when you want to find the best parameters for a function. You need to parse the model function that you'll use and a dict object with the parameters you want to test it. It's a loop actually, but with you don't need waist time coding one.

By calling best_params_ I find out the best value for max_depth is 6. I ended up raising the accuracy to 96.15%. Very close from Gini's based model.

Before you Go

On this post, I explored the benefits of classification with DecisionTreeClassifier() only. Although we trained a couple models and found the best parameters using GridSearch, it still was basically a single Tree..

What if I tell you we have Machine Learning Algorithms that are capable of building not one, but multiple trees at a time, and pick the one with best predicting performance?

So instead of using this…

We could go for that:

Stay tooned!

--

--