Python实现决策树C4.5算法的示例
什么是决策树C4.5算法?
决策树C4.5算法是一种常用的分类算法,它的基本思是通过对数据集进行划分,构建一棵树形结构,从而实现对数据的分类。C4.5算法是ID3算法的改进版,它在ID3算法的基础上引入了信息增益比的概念,解决了ID3算法中存在的一些问题。
决策树C4.5算法的实现步骤
决策树C4.5算法的实现步骤如下:
-
选择最优特征:根据信息增益比选择最优特征作为当前节点的划分特征。
-
划分数据集:根据选择的特征将数据集划分成若干个子集。
-
递归构建决策树:对于个子集,重复步骤1和步骤2,直到所有子集都为同一类别或者无法再划分为止。
4.枝处理:对构建好的决策树进行剪枝处理,提高决策树的泛化能力。
示例1:使用决策树C4.5算法进行分类
下面是一个示例,用于演示如何使用决策树C4.5算法进行分类。在这个示例中,我们假设有一个数据集,它包含5个样本,每个样本有两个特征,分别是年龄和收入,以及一个类别标签,表示该样本属于哪个类别。数据集如下:
年龄 | 收入 | 类别 |
---|---|---|
20 | 2000 | 0 |
25 | 2500 | 0 |
30 | 3000 | 1 |
35 | 3500 | 1 |
40 | 4000 | 1 |
我们需要使用决策树C4.5算法对这个数据集进行分类。
import math
def calc_entropy(data_set):
num_entries = len(data_set)
label_counts = {}
for feat_vec in data_set:
current_label = feat_vec[-1]
if current_label not in label_counts.keys():
label_counts[current_label] = 0
label_counts[current_label] += 1
entropy = 0.0
for key in label_counts:
prob = float(label_counts[key]) / num_entries
entropy -= prob * math.log(prob, 2)
return entropy
def split_data_set(data_set, axis, value):
ret_data_set = []
for feat_vec in data_set:
if feat_vec[axis] == value:
reduced_feat_vec = feat_vec[:axis]
reduced_feat_vec.extend(feat_vec[axis+1:])
ret_data_set.append(reduced_feat_vec)
return ret_data_set
def choose_best_feature_to_split(data_set):
num_features = len(data_set[0]) - 1
base_entropy = calc_entropy(data_set)
best_info_gain_ratio = 0.0
best_feature = -1
for i in range(num_features):
feat_list = [example[i] for example in data_set]
unique_vals = set(feat_list)
new_entropy = 0.0
split_info = 0.0
for value in unique_vals:
sub_data_set = split_data_set(data_set, i, value)
prob = len(sub_data_set) / float(len(data_set))
new_entropy += prob * calc_entropy(sub_data_set)
split_info -= prob * math.log(prob, 2)
info_gain = base_entropy - new_entropy
if split_info == 0.0:
continue
info_gain_ratio = info_gain / split_info
if info_gain_ratio > best_info_gain_ratio:
best_info_gain_ratio = info_gain_ratio
best_feature = i
return best_feature
def majority_cnt(class_list):
class_count = {}
for vote in class_list:
if vote not in class_count.keys():
class_count[vote] = 0
class_count[vote] += 1
sorted_class_count = sorted(class_count.items(), key=lambda x: x[1], reverse=True)
return sorted_class_count[0][0]
def create_tree(data_set, labels):
class_list = [example[-1] for example in data_set]
if class_list.count(class_list[0]) == len(class_list):
return class_list[0]
if len(data_set[0]) == 1:
return majority_cnt(class_list)
best_feat = choose_best_feature_to_split(data_set)
best_feat_label = labels[best_feat]
my_tree = {best_feat_label: {}}
del(labels[best_feat])
feat_values = [example[best_feat] for example in data_set]
unique_vals = set(feat_values)
for value in unique_vals:
sub_labels = labels[:]
my_tree[best_feat_label][value] = create_tree(split_data_set(data_set, best_feat, value), sub_labels)
return my_tree
data_set = [[20, 2000, 0], [25, 2500, 0], [30, 3000, 1], [35, 3500, 1], [40, 4000, 1]]
labels = ['age', 'income']
my_tree = create_tree(data_set, labels)
print(my_tree)
在这个示例中,我们定义了calc_entropy、split_data_set、choose_best_feature_to_split、majority_cnt和create_tree五个函数,用于实现决策树C4.5算法。然后,我们使用data_set和labels两个参数调用create_tree函数,构建决策树,并输出结果。
示例2:使用决策树C4.5算法进行分类
下面是另一个示例,用于演示如何使用决策树C4.5算法进行分类。在这个示例中,我们假设有一个数据集,它包含8个样本,每个样本有三个特征,分别是色泽、根蒂和敲声,以及一个类别标签,表示该样本属于哪个类别。数据集如下:
色泽 | 根蒂 | 敲声 | 类别 |
---|---|---|---|
青绿 | 蜷缩 | 浊响 | 0 |
乌黑 | 蜷缩 | 沉闷 | 0 |
乌黑 | 稍蜷 | 浊响 | 1 |
青绿 | 稍蜷 | 浊响 | 1 |
浅白 | 蜷缩 | 浊响 | 0 |
乌黑 | 稍蜷 | 沉闷 | 1 |
青绿 | 硬挺 | 浊响 | 0 |
浅白 稍蜷 | 浊响 | 1 |
我们需要使用决策树C4.5算法对这个数据集进行分类。
import math
def calc_entropy(data_set):
num_entries = len(data_set)
label_counts = {}
for feat_vec in data_set:
current_label = feat_vec[-1]
if current_label not in label_counts.keys():
label_counts[current_label] = 0
label_counts[current_label] += 1
entropy = 0.0
for key in label_counts:
prob = float(label_counts[key]) / num_entries
entropy -= prob * math.log(prob, 2)
return entropy
def split_data_set(data_set, axis, value):
ret_data_set = []
for feat_vec in data_set:
if feat_vec[axis] == value:
reduced_feat_vec = feat_vec[:axis]
reduced_feat_vec.extend(feat_vec[axis+1:])
ret_data_set.append(reduced_feat_vec)
return ret_data_set
def choose_best_feature_to_split(data_set):
num_features = len(data_set[0]) - 1
base_entropy = calc_entropy(data_set)
best_info_gain_ratio = 0.0
best_feature = -1
for i in range(num_features):
feat_list = [example[i] for example in data_set]
unique_vals = set(feat_list)
new_entropy = 0.0
split_info = 0.0
for value in unique_vals:
sub_data_set = split_data_set(data_set, i, value)
prob = len(sub_data_set) / float(len(data_set))
new_entropy += prob * calc_entropy(sub_data_set)
split_info -= prob * math.log(prob, 2)
info_gain = base_entropy - new_entropy
if split_info == 0.0:
continue
info_gain_ratio = info_gain / split_info
if info_gain_ratio > best_info_gain_ratio:
best_info_gain_ratio = info_gain_ratio
best_feature = i
return best_feature
def majority_cnt(class_list):
class_count = {}
for vote in class_list:
if vote not in class_count.keys():
class_count[vote] = 0
class_count[vote] += 1
sorted_class_count = sorted(class_count.items(), key=lambda x: x[1], reverse=True)
return sorted_class_count[0][0]
def create_tree(data_set, labels):
class_list = [example[-1] for example in data_set]
if class_list.count(class_list[0]) == len(class_list):
return class_list[0]
if len(data_set[0]) == 1:
return majority_cnt(class_list)
best_feat = choose_best_feature_to_split(data_set)
best_feat_label = labels[best_feat]
my_tree = {best_feat_label: {}}
del(labels[best_feat])
feat_values = [example[best_feat] for example in data_set]
unique_vals = set(feat_values)
for value in unique_vals:
sub_labels = labels[:]
my_tree[best_feat_label][value] = create_tree(split_data_set(data_set, best_feat, value), sub_labels)
return my_tree
data_set = [['青绿', '蜷缩', '浊响', 0], ['乌黑', '蜷缩', '沉闷', 0], ['乌黑', '稍蜷', '浊响', 1], ['青绿', '稍蜷', '浊响', 1], ['浅白', '蜷缩', '浊响', 0], ['乌黑', '稍蜷', '沉闷', 1], ['青绿', '硬挺', '浊响', 0], ['浅白 稍蜷', '浊响', 1]]
labels = ['色泽', '根蒂', '敲声']
my_tree = create_tree(data_set, labels)
print(my_tree)
在这个示例中,我们定义了calc_entropy、split_data_set、choose_best_feature_to_split、majority_cnt和create_tree五个函数,用于实现决策树C4.5算法。然后,我们使用data_set和labels两个参数调用create_tree函数,构建决策,并输出结果。
结论
本文介绍了如何使用Python实现决策树C4.5算法,并提供了两个示例说明。在实际应用中,我们可以根据具体的问题选择不同的算法实现方式,并结合其他算法进行综合处理,实现复杂的数据结构和算法。