Machine Learning and Data Mining
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
ST443: Machine Learning and Data Mining
Lecture 7: Trees and forests
Key learning points
I Decision trees
I How to prune a tree
I Gradient boosting
I Bootstrap aggregation
I Random forest
Tengyao Wang 2/37
Tree-based methods
I We focus on the tree-based methods for regression and classification.
I These involve stratifying or segmenting the predictor space into a number
of simple regions.
I Since the set of spliing rules used to segment the predictor space can be
summarized in a tree, these types of approaches are know as decision tree
methods.
Tengyao Wang 3/37
Baseball player salary data
I Salary is colored from low (blue, green) to high (yellow, red).
I How would you stratify it?
Tengyao Wang 4/37
Decision tree for the data
R1 = {Years < 4.5}, R2 = {Years ≥ 4.5 and Hits < 117.5} and
R3 = {Years ≥ 4.5 and Hits > 117.5}.
Tengyao Wang 5/37
Details for the decision tree
I In keeping with the tree analogy, the regions R1, R2 and R3 are known as
terminal nodes or leaves of the tree.
I Decision tree are typically drawn upside down, in the sense the leaves are
at the boom of the tree.
I The points along the tree where predictor space is split are referred to as
internal nodes.
I In the hiers tree, the two internal nodes are indicated by the text
Years < 4.5 and Hit < 117.5.
I We refer to the segments of the trees that connect the nodes as branches.
Tengyao Wang 6/37
Interpretation of the decision tree results
I Years is the most important factor in determining Salary and players with
less experience earn lower salaries than more experienced players.
I Given that a player is less experienced, the number of Hits that he made in
the previous year seems to play lile role in his Salary.
I But among players who have been in the major leagues for five or more
years, the number of Hits made in the previous year does aect Salary,
and players who made more Hits last year tend to have higher salaries.
I The figure is likely a over-simplification, but compared to a regression
model, is advantageous in displaying and interpreting.
Tengyao Wang 7/37
Details in tree building process
I We divide the predictor/covariate space, i.e. the set of possible values for
X1, . . . , Xp into J distinct and non-overlapping regions R1, R2, . . . , RJ .
I For every observation that falls into the region Rj , we make the same
prediction, which is simply mean of the response values for the traning
observation in Rj .
I In theory, the regions could have any shape. Decision trees divide the
predictor space into axis-aligned high dimensional rectangles for simplicity
and interpretability.
I Our target is to find rectangles, R1, . . . , RJ that minimize the RSS
J∑
j=1
∑
i∈Rj
(yi − yˆRj )2,
where yˆRj = N
−1
j
∑
i:xi∈Rj yi and Nj = #{i : xi ∈ Rj}.
Tengyao Wang 8/37
More details in tree building process
I It is computational infeasible to consider every possible partition of the
feature space into J rectangles.
I We take a top-down, greedy approach that is known as recursive
binary spliing.
I Top-down: it begins at the top of tree and then successively splits the
predictor space, each split is indicated via two new branches further down
on the tree.
I Greedy: at each step of the tree-building process, the best split is made at
the particular step, rather than looking ahead and picking a split that will
lead to a beer tree in some future step.
Tengyao Wang 9/37
More details in tree building process
I We first select the predictor Xj and cutpoint s such that spliing the
predictor space into regions {X : Xj < s} and {X : Xj ≥ s} leads the the
greatest possible reduction in RSS.
I Then, we repeat the process, searching for the best predictor and cutpoint
to further split the data so as to minimize the RSS within each of the
resulting regions. Note rather than spliing the entire space, we split one
of the two previously identified regions. Now we have three regions.
I Again, we look to split one of these three regions further, so as to minimize
the RSS. The process continues until a stopping criterion is reached, e.g. we
may continue until no region contains no more than five observations.
Tengyao Wang 10/37
Spliing the predictor space
Generally we create the partitions by iteratively spliing one of the X variables
into two regions. For instance:
1. First split on X1 = t1.
2. If X1 < t1, split on X2 = t2.
3. If X1 > t1 split on X1 = t3.
4. If X1 > t3, split on X2 = t4.
Tengyao Wang 11/37
Spliing the predictor space
Generally we create the partitions by iteratively spliing one of the X variables
into two regions. For instance:
1. First split on X1 = t1.
2. If X1 < t1, split on X2 = t2.
3. If X1 > t1 split on X1 = t3.
4. If X1 > t3, split on X2 = t4.
Tengyao Wang 11/37
Spliing the predictor space
Generally we create the partitions by iteratively spliing one of the X variables
into two regions. For instance:
1. First split on X1 = t1.
2. If X1 < t1, split on X2 = t2.
3. If X1 > t1 split on X1 = t3.
4. If X1 > t3, split on X2 = t4.
Tengyao Wang 11/37
Spliing the predictor space
Generally we create the partitions by iteratively spliing one of the X variables
into two regions. For instance:
1. First split on X1 = t1.
2. If X1 < t1, split on X2 = t2.
3. If X1 > t1 split on X1 = t3.
4. If X1 > t3, split on X2 = t4.
Tengyao Wang 11/37
Spliing the predictor space
Generally we create the partitions by iteratively spliing one of the X variables
into two regions. For instance:
1. First split on X1 = t1.
2. If X1 < t1, split on X2 = t2.
3. If X1 > t1 split on X1 = t3.
4. If X1 > t3, split on X2 = t4.
Tengyao Wang 11/37
Estimator for regression trees
I Consider the statistical learning model:
Yi = f(Xi) + εi, i = 1, . . . , n,
where Xi = (xi1, . . . , xip)>.
I A regression tree T with leaf regions R1, . . . , RJ corresponds to the
estimator fˆT such that for x ∈ Rp is
fˆ(x) :=
J∑
j=1
Y¯Rj1{x ∈ Rj},
where Y¯Rj := N
−1
j
∑
i:Xi∈Rj Yi and Nj =
∑n
i=1 I(Xi ∈ Rj).
I The training RSS of the regression tree T is
RSS(T ) :=
J∑
j=1
∑
i:Xi∈Rj
(Yi − Y¯Rj )2.
Tengyao Wang 12/37
Pruning a tree
I The tree building process may produce good predictions on the training set,
but is likely to overfit the data, leading to poor test set performance.
I We can first grow a very big tree T0 and then prune it back to obtain a
subtree.
I Cost complexity pruning, we consider a sequence of trees indexed by a
tuning parameter α > 0. For each α, we seek a subtree Tα ⊂ T0 such that
Tα = argmin
T⊂T0
{RSS(T ) + α|T |},
where |T | indicates the number of leaves of T .
Tengyao Wang 13/37