A Decision Tree is one of the simpler yet most powerful machine learning algorithms used in classification and regression. In this blog post, I’ll go through basic concepts in Decision Trees and walk through Python code to leverage the power of the Random Forest algorithm to create a simple Netflix-style movie recommendation system.
First, a decision tree is what it might sound like to you — it is a tree-like graph representing decisions with nodes and edges. A decision tree always starts with a simple question. For example, “Should I bring an umbrella with me?” based on if it will rain later. Figure 1 below shows a simple decision tree of bringing an umbrella (or not) under the probability of raining later.
Because decisions naturally occur because of unknown variables (in this case, whether it will rain), a decision tree usually deals with probabilities.
From here, the next step is to decide the probability of the chance nodes, and the weighted score of each end node in order to utilize the tree and make an informed decision. Figure 2 shows the chance of rain is 70% and the chance of not raining is 30%. In the case of rain, bringing an umbrella will be of benefit to you which gives a score of +1, while not bringing it will cause you to likely get wet and get a penalty score of -1. In the case of no rain though, bringing along an umbrella will cause you an inconvenient score of -1, while not bringing it will be a “break-even” situation with a score of 0.
To calculate the weight of each decision, first, calculate the value of the first branch:
Rain:
With umbrella: 1 * 0.7 = 0.7
No umbrella: -1 * 0.7 = -0.7
0.7 + (–0.7) = 0
No Rain:
With umbrella: -1 * 0.3 = -0.3
No umbrella: 0 * 0.3 = 0
-0.3 + 0 = -0.3
It is quite clear that bringing an umbrella along when there is a 70% chance of rain is a wise decision.
Decision Tree in Machine Learning
In machine learning, the algorithm to create a decision tree is quite different from what we have done intuitively in the umbrella problem. We will begin by tackling a small dataset of movies as seen in figure 3 below. This is the data of a user, whom we will name Bob, and the movies he had watched and not watched thus far.
The headers “director”, “category”, and “imdb_rating” are what we call features or attributes. The field “watched” is what we call class, which in this case is a binary “yes” or “no” (this makes this a binary classification problem).
We want to predict if we should recommend “6 Underground”, directed by Michael Bay (the test sample on the last row) to Bob. In another word, we want to predict if Bob would likely watch it.
When we build the tree for this movie problem, we start with splitting either of the director, category, or imdb_rating. However, how do we know which attribute should be split first?
Let’s for the sake of learning decide that we will split category first. And based on the attribute we count the classes for each.
In Search of Purity
Notice that the Horror category gets 3 yes’s without a single no. We call this node pure. Because the algorithm knows with 100% certainty that Bob watched horror movies 100% of the time, it no longer splits. If Bob didn’t watch a horror movie at all, making all 3 no’s for horror, this node would still be considered pure.
Without getting into the detail of which node to split, we continue to split the impure nodes. In this case, let’s split Action by the director node, as shown in figure 6, again counting the classes for each director. We find out that both Michael Bay and Taika Waititi are pure.
Without further splitting the tree, the algorithm at this point can very well predict that “6 Underground”, an action movie directed by Michael Bay, will likely be watchable by Bob, as seen in figure 7.
Entropy and Information Gain
We have not answered the previous question of how a decision tree algorithm chooses a node to split out of the available attributes director, category, and imdb_rating.
At the highest level, a decision tree algorithm is very specific about selecting a node to split. It doesn’t randomly pick one as we did previously. Instead, it does so by deciding which route has the overall highest level of purity, or to be specific, the quickest increase in purity. How does it do that? Basically, it decides which route to go by asking “How much information will this route gain?” or “How much more certain will this route be compared to other branches?”
We will first introduce entropy, a concept from information theory that’s being used widely in the machine learning field. In essence, entropy means the amount of uncertainty or the lack of information. In this case, the higher the entropy, the lower the certainty and purity. Entropy is measured in bits as values ranging from 0 to 1. The first is no entropy or 100% purity and the latter is maximum entropy and impure. Entropy can be calculated with the following formula:
Entropy = -P₍ᵧ₎ * log₂ P₍ᵧ₎ - P₍ₙ₎ * log₂ P₍ₙ₎
Where P₍ᵧ₎ stands for the probability of yes and P₍ₙ₎ for the probability of no. For example, to calculate the entropy of the node Action, we replace P₍ᵧ₎ = 7/8 and P₍ₙ₎ = 1/8 (7 yes / 1 no). Thus we get:
Entropy = -6/8 * (-0.19) - 1/8 * (-3.0) = 0.50
The entropy for Drama, Scifi, and Horror are 0.21, 0.19, and 0.0 respectively. Note that because Horror is pure, entropy is 0. Now, why were we calculating the entropies for all these nodes? The secret sauce of the decision tree algorithm to select an attribute node to split is Information Gain. It chooses an attribute with the highest information gain among others. Information Gain can be calculated with the following formula:
Information Gain = entropy of parent - average entropy of children
To calculate the information gain for the category attribute, first, let’s calculate the entropy of the parent. Because the sample size of the dataset is 21, with 14 yes’s and 7 no’s, we substitute P₍ᵧ₎ = 14/21 and P₍ₙ₎ = 7/21. Therefore, the parent’s entropy is:
Parent Entropy = -14/21 * (-0.6) - 7/21 * (-1.6) = 0.13
In order to calculate the average entropy of all the children, whose entropies are:
- Action = 0.5
- Scifi = 0.19
- Drama = 0.21
- Horror = 0.0
We multiply each entropy by the size of that child, divided by the total size of the dataset. For example, Action appears in 8 out of 21 rows, so we calculate the weight of Action by 8/21 * 0.5 = 0.19. Scifi appears 4 out of 21 rows, therefore 4/21 * 0.19 = 0.04. Drama appears 5 out of 21 rows, 5/21 * 0.21 = 0.05, and Horror = 0. We then sum these weighted values together to get the average entropy.
Average entropy of children = 0.19 + 0.04 + 0.05 + 0 = 0.28
Then subtract it from the parent’s entropy:
Information Gain = 0.13 - 0.28 = -0.15
We arrive at the information gain for the category attribute of -0.15. The algorithm has to calculate the information gain for the director and imdb_rating attributes and select the one with the highest information gain to split.
Decision Tree in Practice
With a basic understanding of the concept, we are now ready to write some code to build a recommendation engine. In practice, the machine learning libraries already implement all the necessary algorithms needed to build the tree.
First off, let’s import all necessary modules and load the CSV dataset as a dataframe.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from enum import Enum
df = pd.read_csv(
'https://gist.githubusercontent.com/jochasinga/a8e03d61df616728603d9863de2733c6/raw/0bb89604b54d4547849d401545a3dfcf551cdade/movies.csv',
header=0,
)
Because some of the features and the label classes are categorical, we will encode them as integers instead of strings. We import LabelEncoder
to modifier the label (watched) and OrdinalEncoder
encode the features.
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder
df.replace('?', 'Yes', inplace=True)
le = LabelEncoder()
df['watched'] = le.fit_transform(df['watched'])
oe = OrdinalEncoder(dtype=np.int32)
df[['director', 'category']] = oe.fit_transform(df[['director', 'category']])Now the dataset should look similar to this:
Now we can call train_test_split
to split the dataset into training and test samples, create a DecisionTreeClassifier
object, and call fit
to train the model. Lastly, we print the accuracy of the model and call predict
on the test samples to predict the outputs. Note that because of the small dataset, the accuracy won’t be very good.
all_inputs = df[['director', 'category', 'imdb_rating']].values
all_classes = df['watched'].values
(train_inputs, test_inputs, train_classes, test_classes) = train_test_split(all_inputs, all_classes, train_size=0.8,random_state=1)
classifier = DecisionTreeClassifier()
classifier.fit(train_inputs, train_classes)
print("Accuracy", classifier.score(test_inputs, test_classes))
classifier.predict(test_inputs)
Conclusion
A decision tree is one of the most powerful machine learning algorithms used extensively in finance and healthcare, among other industries…. TBD