关键词:机器学习、ID3算法、决策树、信息增益、熵、手动计算、天气打球数据集、Python ID3、Java ID3、Iterative Dichotomiser 3
一句话答案:ID3 是最早的决策树算法之一,通过信息增益选择最优划分特征,递归构建树结构。它奠定了 CART、C4.5 等后续算法的基础!
如果你在搜索:
那么,这篇文章就是为你写的——从熵到树根,一步不跳。
ID3(Iterative Dichotomiser 3)由 Ross Quinlan 于1986年提出,是首个基于信息论的决策树算法。
给定一个带标签的表格数据集,自动构建一棵if-else规则树,用于分类新样本。
💡 虽然ID3已被C4.5、CART取代,但它是理解决策树思想的最佳起点。



✅ ID3选择信息增益最大的特征作为当前节点的划分依据
编号 | 天气 | 温度 | 湿度 | 风力 | 打球? |
|---|---|---|---|---|---|
1 | 晴 | 热 | 高 | 弱 | 否 |
2 | 晴 | 热 | 高 | 强 | 否 |
3 | 阴 | 热 | 高 | 弱 | 是 |
4 | 雨 | 凉 | 正常 | 弱 | 是 |
5 | 雨 | 冷 | 正常 | 弱 | 是 |
6 | 雨 | 冷 | 正常 | 强 | 否 |
7 | 阴 | 冷 | 正常 | 强 | 是 |
8 | 晴 | 凉 | 高 | 弱 | 否 |
9 | 晴 | 冷 | 正常 | 弱 | 是 |
10 | 雨 | 凉 | 正常 | 弱 | 是 |
11 | 晴 | 凉 | 正常 | 强 | 是 |
12 | 阴 | 凉 | 高 | 强 | 是 |
13 | 阴 | 热 | 正常 | 弱 | 是 |
14 | 雨 | 凉 | 高 | 强 | 否 |
目标:构建ID3决策树,预测新天气条件下是否打球。





特征 | 信息增益 |
|---|---|
天气 | 0.247 ← 最大 |
湿度 | 0.151 |
风力 | 0.048 |
温度 | 0.029 |
✅ 根节点 = 天气
→ 创建节点“湿度”,其下:
→ 叶节点:是
→ 创建节点“风力”,其下:
天气?
/ | \
晴 阴 雨
/ | \
湿度? 是 风力?
/ \ / \
高 正常 弱 强
| | | |
否 是 是 否✅ 完美分类训练集(但可能过拟合!)
import math
from collections import Counter
def entropy(data):
labels = [row[-1] for row in data]
label_counts = Counter(labels)
total = len(labels)
ent = 0.0
for count in label_counts.values():
p = count / total
ent -= p * math.log2(p)
return ent
def split_data(data, feature_idx, value):
return [row for row in data if row[feature_idx] == value]
def information_gain(data, feature_idx):
base_entropy = entropy(data)
feature_values = set(row[feature_idx] for row in data)
weighted_entropy = 0.0
for val in feature_values:
subset = split_data(data, feature_idx, val)
prob = len(subset) / len(data)
weighted_entropy += prob * entropy(subset)
return base_entropy - weighted_entropy
def id3(data, features, feature_names):
labels = [row[-1] for row in data]
if len(set(labels)) == 1:
return labels[0] # 纯节点
if not features:
return Counter(labels).most_common(1)[0][0] # 投票
# 选择信息增益最大的特征
best_feature_idx = max(features, key=lambda i: information_gain(data, i))
best_feature_name = feature_names[best_feature_idx]
tree = {best_feature_name: {}}
feature_values = set(row[best_feature_idx] for row in data)
remaining_features = [i for i in features if i != best_feature_idx]
for val in feature_values:
subset = split_data(data, best_feature_idx, val)
if not subset:
tree[best_feature_name][val] = Counter(labels).most_common(1)[0][0]
else:
subtree = id3(subset, remaining_features, feature_names)
tree[best_feature_name][val] = subtree
return tree
# 数据准备
data = [
["晴", "热", "高", "弱", "否"],
["晴", "热", "高", "强", "否"],
["阴", "热", "高", "弱", "是"],
["雨", "凉", "正常", "弱", "是"],
["雨", "冷", "正常", "弱", "是"],
["雨", "冷", "正常", "强", "否"],
["阴", "冷", "正常", "强", "是"],
["晴", "凉", "高", "弱", "否"],
["晴", "冷", "正常", "弱", "是"],
["雨", "凉", "正常", "弱", "是"],
["晴", "凉", "正常", "强", "是"],
["阴", "凉", "高", "强", "是"],
["阴", "热", "正常", "弱", "是"],
["雨", "凉", "高", "强", "否"]
]
features = list(range(len(data[0]) - 1)) # [0,1,2,3]
feature_names = ["天气", "温度", "湿度", "风力"]
tree = id3(data, features, feature_names)
print("ID3决策树:", tree)import java.util.*;
public class ID3 {
static class TreeNode {
String feature;
Map<String, TreeNode> children = new HashMap<>();
String label; // if leaf
TreeNode(String feature) { this.feature = feature; }
TreeNode(String feature, String label) { this.feature = feature; this.label = label; }
}
public static double entropy(List<String[]> data) {
Map<String, Integer> count = new HashMap<>();
for (String[] row : data) {
String label = row[row.length - 1];
count.put(label, count.getOrDefault(label, 0) + 1);
}
double ent = 0.0;
int total = data.size();
for (int c : count.values()) {
double p = (double) c / total;
ent -= p * (Math.log(p) / Math.log(2));
}
return ent;
}
public static List<String[]> splitData(List<String[]> data, int featureIdx, String value) {
List<String[]> subset = new ArrayList<>();
for (String[] row : data) {
if (row[featureIdx].equals(value)) {
subset.add(row);
}
}
return subset;
}
public static double informationGain(List<String[]> data, int featureIdx) {
double baseEnt = entropy(data);
Set<String> values = new HashSet<>();
for (String[] row : data) values.add(row[featureIdx]);
double weightedEnt = 0.0;
for (String val : values) {
List<String[]> subset = splitData(data, featureIdx, val);
double prob = (double) subset.size() / data.size();
weightedEnt += prob * entropy(subset);
}
return baseEnt - weightedEnt;
}
public static String majorityLabel(List<String[]> data) {
Map<String, Integer> count = new HashMap<>();
for (String[] row : data) {
String label = row[row.length - 1];
count.put(label, count.getOrDefault(label, 0) + 1);
}
return Collections.max(count.entrySet(), Map.Entry.comparingByValue()).getKey();
}
public static TreeNode id3(List<String[]> data, List<Integer> features, String[] featureNames) {
// 检查是否纯节点
Set<String> labels = new HashSet<>();
for (String[] row : data) labels.add(row[row.length - 1]);
if (labels.size() == 1) {
return new TreeNode("leaf", labels.iterator().next());
}
if (features.isEmpty()) {
return new TreeNode("leaf", majorityLabel(data));
}
// 选择最佳特征
int bestIdx = -1;
double bestGain = -1;
for (int idx : features) {
double gain = informationGain(data, idx);
if (gain > bestGain) {
bestGain = gain;
bestIdx = idx;
}
}
TreeNode node = new TreeNode(featureNames[bestIdx]);
Set<String> values = new HashSet<>();
for (String[] row : data) values.add(row[bestIdx]);
List<Integer> remainingFeatures = new ArrayList<>(features);
remainingFeatures.remove(Integer.valueOf(bestIdx));
for (String val : values) {
List<String[]> subset = splitData(data, bestIdx, val);
if (subset.isEmpty()) {
node.children.put(val, new TreeNode("leaf", majorityLabel(data)));
} else {
node.children.put(val, id3(subset, remainingFeatures, featureNames));
}
}
return node;
}
// 测试
public static void main(String[] args) {
List<String[]> data = Arrays.asList(
new String[]{"晴", "热", "高", "弱", "否"},
new String[]{"晴", "热", "高", "强", "否"},
new String[]{"阴", "热", "高", "弱", "是"},
new String[]{"雨", "凉", "正常", "弱", "是"},
new String[]{"雨", "冷", "正常", "弱", "是"},
new String[]{"雨", "冷", "正常", "强", "否"},
new String[]{"阴", "冷", "正常", "强", "是"},
new String[]{"晴", "凉", "高", "弱", "否"},
new String[]{"晴", "冷", "正常", "弱", "是"},
new String[]{"雨", "凉", "正常", "弱", "是"},
new String[]{"晴", "凉", "正常", "强", "是"},
new String[]{"阴", "凉", "高", "强", "是"},
new String[]{"阴", "热", "正常", "弱", "是"},
new String[]{"雨", "凉", "高", "强", "否"}
);
List<Integer> features = Arrays.asList(0, 1, 2, 3);
String[] featureNames = {"天气", "温度", "湿度", "风力"};
TreeNode tree = id3(data, features, featureNames);
System.out.println("ID3决策树已构建(结构见控制台输出逻辑)");
// 实际打印需额外遍历函数,此处略
}
}优点 | 缺点 |
|---|---|
✅ 模型可解释性强(if-else规则) | ❌ 容易过拟合(无剪枝) |
✅ 无需数据标准化 | ❌ 信息增益偏向多值特征 |
✅ 训练速度快 | ❌ 不能处理连续特征(原生ID3) |
✅ 天然支持多分类 | ❌ 对噪声敏感 |
算法 | 特征类型 | 分裂标准 | 剪枝 | 输出 |
|---|---|---|---|---|
ID3 | 离散 | 信息增益 | ❌ 无 | 分类树 |
C4.5 | 离散+连续 | 信息增益率 | ✅ 有 | 分类树 |
CART | 离散+连续 | 基尼系数/平方误差 | ✅ 有 | 分类+回归树 |
💡 建议:实际项目优先使用 C4.5(J48)或 CART(sklearn DecisionTree),ID3主要用于学习。
ID3用信息论为决策树注入了科学灵魂。它告诉我们:最好的划分,是让结果最确定的那个。
记住:在机器学习中,减少不确定性就是进步。
现在,你已经能:
相关链接
无论你是想写代码调用 API 的开发者,设计 AI 产品的 PM,评估技术路线的管理者,还是单纯好奇智能本质的思考者,这里都有值得你驻足的内容。 不追 hype,只讲逻辑;不谈玄学,专注可复现的认知。 让我们一起,在这场百年一遇的智能革命中,看得更清,走得更稳 https://cloud.tencent.com/developer/column/107314
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。