Decision
Tree - Regression
Decision tree builds
regression or classification models in the form of a tree
structure. It brakes down a dataset into smaller and smaller
subsets while at the same time an associated decision tree is
incrementally developed. The final result is a tree
with decision
nodes and leaf nodes.
A decision node (e.g., Outlook) has two or more branches (e.g.,
Sunny, Overcast and Rainy), each representing values for the
attribute tested. Leaf node (e.g., Hours Played) represents a
decision on the numerical target. The topmost decision node in a
tree which corresponds to the best predictor
called root node. Decision trees can handle
both categorical and numerical data.

Decision Tree Algorithm
The core algorithm for
building decision trees
called ID3 by J. R.
Quinlan which employs a top-down, greedy search through the space
of possible branches with no backtracking. The ID3 algorithm can be
used to construct a decision tree for regression by replacing
Information Gain with Standard
Deviation Reduction.
Standard
Deviation
A decision tree is built
top-down from a root node and involves partitioning the data into
subsets that contain instances with similar values (homogenous). We
use standard deviation to calculate the homogeneity of a numerical
sample. If the numerical sample is completely homogeneous its
standard deviation is zero.
a) Standard deviation
for one attribute:

b) Standard deviation
for two attributes:

Standard Deviation
Reduction
The standard deviation
reduction is based on the decrease in standard deviation after a
dataset is split on an attribute. Constructing a decision tree is
all about finding attribute that returns the highest standard
deviation reduction (i.e., the most homogeneous
branches).
Step 1: The standard
deviation of the target is
calculated.
Standard deviation (Hours
Played) = 9.32
Step 2: The dataset is
then split on the different attributes. The standard deviation for
each branch is calculated. The resulting standard deviation is
subtracted from the standard deviation before the split. The result
is the standard deviation reduction.


Step 3: The attribute
with the largest standard deviation reduction is chosen for the
decision node.

Step 4a: Dataset is
divided based on the values of the selected attribute.

Step 4b: A branch set
with standard deviation more than 0 needs further
splitting.
In practice, we need some
termination criteria. For example, when standard deviation for the
branch becomes smaller than a certain fraction (e.g., 5%) of
standard deviation for the full
dataset OR when too few
instances remain in the branch (e.g., 3).

Step 5: The process is
run recursively on the non-leaf branches, until all data is
processed.
When the number of instances
is more than one at a leaf node we calculate
the average as the final
value for the target.
本文介绍了使用MATLAB进行决策树回归分析的方法,包括ID3算法和标准差减少的概念。决策树通过划分数据集来构建回归或分类模型,其中决策节点代表属性测试,叶节点代表数值目标。标准差减少用于评估数据集划分后的纯度,递归地构建决策树直至满足终止条件,如标准差低于一定比例或实例数量过少。
&spm=1001.2101.3001.5002&articleId=115931176&d=1&t=3&u=c414dba48d474242999109ee4a9037a4)
817

被折叠的 条评论
为什么被折叠?



