|
|
马上注册,结交更多好友,享用更多功能,让你轻松玩转社区。
您需要 登录 才可以下载或查看,没有账号?立即注册
x
引言
在机器学习项目中,模型评估与预测是至关重要的环节。它们帮助我们了解模型的性能、选择最佳模型、调整超参数以及最终将模型应用于实际场景。scikit-learn作为Python中最流行的机器学习库之一,提供了丰富的工具和函数来支持模型评估与预测的全过程。本文将带你从基础指标到高级技巧,全面掌握使用scikit-learn进行模型评估与预测的核心技能。
1. 基础评估指标
1.1 分类问题评估指标
准确率是最直观的分类评估指标,表示正确预测的样本数占总样本数的比例。
- from sklearn.metrics import accuracy_score
- from sklearn.datasets import make_classification
- from sklearn.model_selection import train_test_split
- from sklearn.linear_model import LogisticRegression
- # 生成模拟数据
- X, y = make_classification(n_samples=1000, n_classes=2, random_state=42)
- X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
- # 训练模型
- model = LogisticRegression()
- model.fit(X_train, y_train)
- # 预测
- y_pred = model.predict(X_test)
- # 计算准确率
- accuracy = accuracy_score(y_test, y_pred)
- print(f"准确率: {accuracy:.4f}")
复制代码
精确率是指被正确预测为正例的样本数占所有被预测为正例的样本数的比例。召回率是指被正确预测为正例的样本数占所有实际为正例的样本数的比例。
- from sklearn.metrics import precision_score, recall_score
- # 计算精确率和召回率
- precision = precision_score(y_test, y_pred)
- recall = recall_score(y_test, y_pred)
- print(f"精确率: {precision:.4f}")
- print(f"召回率: {recall:.4f}")
复制代码
F1分数是精确率和召回率的调和平均值,是综合评估模型性能的指标。
- from sklearn.metrics import f1_score
- # 计算F1分数
- f1 = f1_score(y_test, y_pred)
- print(f"F1分数: {f1:.4f}")
复制代码
scikit-learn提供了classification_report函数,可以一次性输出多个分类指标。
- from sklearn.metrics import classification_report
- # 生成分类报告
- report = classification_report(y_test, y_pred)
- print("分类报告:")
- print(report)
复制代码
1.2 回归问题评估指标
MAE是预测值与真实值之差的绝对值的平均值。
- from sklearn.datasets import make_regression
- from sklearn.linear_model import LinearRegression
- from sklearn.metrics import mean_absolute_error
- # 生成模拟数据
- X, y = make_regression(n_samples=1000, n_features=10, noise=0.1, random_state=42)
- X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
- # 训练模型
- model = LinearRegression()
- model.fit(X_train, y_train)
- # 预测
- y_pred = model.predict(X_test)
- # 计算MAE
- mae = mean_absolute_error(y_test, y_pred)
- print(f"平均绝对误差 (MAE): {mae:.4f}")
复制代码
MSE是预测值与真实值之差的平方的平均值,RMSE是MSE的平方根。
- from sklearn.metrics import mean_squared_error
- # 计算MSE
- mse = mean_squared_error(y_test, y_pred)
- print(f"均方误差 (MSE): {mse:.4f}")
- # 计算RMSE
- rmse = mean_squared_error(y_test, y_pred, squared=False)
- print(f"均方根误差 (RMSE): {rmse:.4f}")
复制代码
R²分数表示模型对数据方差的解释程度,范围通常在0到1之间,越接近1表示模型拟合效果越好。
- from sklearn.metrics import r2_score
- # 计算R²分数
- r2 = r2_score(y_test, y_pred)
- print(f"决定系数 (R²): {r2:.4f}")
复制代码
2. 高级评估指标与技术
2.1 混淆矩阵 (Confusion Matrix)
混淆矩阵提供了更详细的分类结果展示,包括真正例(TP)、假正例(FP)、假反例(FN)和真反例(TN)。
- import matplotlib.pyplot as plt
- from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
- # 计算混淆矩阵
- cm = confusion_matrix(y_test, y_pred)
- # 可视化混淆矩阵
- disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=model.classes_)
- disp.plot(cmap=plt.cm.Blues)
- plt.title('混淆矩阵')
- plt.show()
复制代码
2.2 ROC曲线和AUC值
ROC曲线展示了在不同阈值下模型的真正例率(TPR)和假正例率(FPR)之间的关系,AUC值表示ROC曲线下的面积,用于评估模型的整体性能。
- from sklearn.metrics import roc_curve, auc, RocCurveDisplay
- # 获取预测概率
- y_prob = model.predict_proba(X_test)[:, 1]
- # 计算ROC曲线
- fpr, tpr, thresholds = roc_curve(y_test, y_prob)
- roc_auc = auc(fpr, tpr)
- # 可视化ROC曲线
- display = RocCurveDisplay(fpr=fpr, tpr=tpr, roc_auc=roc_auc, estimator_name='Logistic Regression')
- display.plot()
- plt.title('ROC曲线')
- plt.show()
- print(f"AUC值: {roc_auc:.4f}")
复制代码
2.3 精确率-召回率曲线
精确率-召回率曲线展示了精确率和召回率之间的权衡关系,特别适用于类别不平衡的数据集。
- from sklearn.metrics import precision_recall_curve, average_precision_score, PrecisionRecallDisplay
- # 计算精确率-召回率曲线
- precision, recall, _ = precision_recall_curve(y_test, y_prob)
- average_precision = average_precision_score(y_test, y_prob)
- # 可视化精确率-召回率曲线
- display = PrecisionRecallDisplay(precision=precision, recall=recall, average_precision=average_precision)
- display.plot()
- plt.title('精确率-召回率曲线')
- plt.show()
- print(f"平均精确率: {average_precision:.4f}")
复制代码
2.4 学习曲线 (Learning Curve)
学习曲线展示了模型性能随训练集大小变化的情况,有助于判断模型是否过拟合或欠拟合。
- import numpy as np
- from sklearn.model_selection import learning_curve
- def plot_learning_curve(estimator, title, X, y, cv=None, n_jobs=None, train_sizes=np.linspace(.1, 1.0, 5)):
- plt.figure()
- plt.title(title)
- plt.xlabel("训练样本数")
- plt.ylabel("得分")
-
- train_sizes, train_scores, test_scores = learning_curve(
- estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
-
- train_scores_mean = np.mean(train_scores, axis=1)
- train_scores_std = np.std(train_scores, axis=1)
- test_scores_mean = np.mean(test_scores, axis=1)
- test_scores_std = np.std(test_scores, axis=1)
-
- plt.grid()
-
- plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
- train_scores_mean + train_scores_std, alpha=0.1,
- color="r")
- plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
- test_scores_mean + test_scores_std, alpha=0.1, color="g")
- plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
- label="训练集得分")
- plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
- label="交叉验证集得分")
-
- plt.legend(loc="best")
- return plt
- # 绘制学习曲线
- plot_learning_curve(model, "学习曲线 (Logistic Regression)", X, y, cv=5)
- plt.show()
复制代码
2.5 验证曲线 (Validation Curve)
验证曲线展示了模型性能随超参数变化的情况,有助于选择最佳超参数值。
- from sklearn.model_selection import validation_curve
- def plot_validation_curve(estimator, title, X, y, param_name, param_range, cv=None, scoring=None, n_jobs=None):
- train_scores, test_scores = validation_curve(
- estimator, X, y, param_name=param_name, param_range=param_range,
- cv=cv, scoring=scoring, n_jobs=n_jobs)
-
- train_scores_mean = np.mean(train_scores, axis=1)
- train_scores_std = np.std(train_scores, axis=1)
- test_scores_mean = np.mean(test_scores, axis=1)
- test_scores_std = np.std(test_scores, axis=1)
-
- plt.title(title)
- plt.xlabel(param_name)
- plt.ylabel("得分")
- plt.ylim(0.0, 1.1)
- lw = 2
- plt.plot(param_range, train_scores_mean, label="训练集得分", color="darkorange", lw=lw)
- plt.fill_between(param_range, train_scores_mean - train_scores_std,
- train_scores_mean + train_scores_std, alpha=0.2,
- color="darkorange", lw=lw)
- plt.plot(param_range, test_scores_mean, label="交叉验证集得分", color="navy", lw=lw)
- plt.fill_between(param_range, test_scores_mean - test_scores_std,
- test_scores_mean + test_scores_std, alpha=0.2,
- color="navy", lw=lw)
- plt.legend(loc="best")
- return plt
- # 绘制验证曲线
- param_range = np.logspace(-3, 3, 7)
- plot_validation_curve(model, "验证曲线 (Logistic Regression)", X, y,
- param_name="C", param_range=param_range, cv=5)
- plt.xscale('log')
- plt.show()
复制代码
3. 交叉验证技术
3.1 K折交叉验证 (K-Fold Cross-Validation)
K折交叉验证将数据集分成K个子集,每次使用K-1个子集进行训练,剩余1个子集进行验证,重复K次。
- from sklearn.model_selection import cross_val_score
- # 执行5折交叉验证
- cv_scores = cross_val_score(model, X, y, cv=5)
- print(f"交叉验证得分: {cv_scores}")
- print(f"平均交叉验证得分: {cv_scores.mean():.4f}")
- print(f"交叉验证得分标准差: {cv_scores.std():.4f}")
复制代码
3.2 分层K折交叉验证 (Stratified K-Fold Cross-Validation)
分层K折交叉验证确保每个折中各类别的比例与整个数据集中的比例相同,特别适用于类别不平衡的数据集。
- from sklearn.model_selection import StratifiedKFold
- # 创建分层K折交叉验证对象
- stratified_kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
- # 执行分层K折交叉验证
- stratified_cv_scores = cross_val_score(model, X, y, cv=stratified_kfold)
- print(f"分层交叉验证得分: {stratified_cv_scores}")
- print(f"平均分层交叉验证得分: {stratified_cv_scores.mean():.4f}")
- print(f"分层交叉验证得分标准差: {stratified_cv_scores.std():.4f}")
复制代码
3.3 留一交叉验证 (Leave-One-Out Cross-Validation)
留一交叉验证是K折交叉验证的特例,其中K等于样本数,每次只留一个样本进行验证。
- from sklearn.model_selection import LeaveOneOut
- # 创建留一交叉验证对象
- loo = LeaveOneOut()
- # 执行留一交叉验证(注意:计算量较大,适合小数据集)
- # loo_scores = cross_val_score(model, X, y, cv=loo)
- # print(f"留一交叉验证平均得分: {loo_scores.mean():.4f}")
复制代码
3.4 时间序列交叉验证 (Time Series Split)
时间序列交叉验证专门用于时间序列数据,确保验证集始终在训练集之后。
- from sklearn.model_selection import TimeSeriesSplit
- # 创建时间序列交叉验证对象
- tscv = TimeSeriesSplit(n_splits=5)
- # 执行时间序列交叉验证
- ts_scores = cross_val_score(model, X, y, cv=tscv)
- print(f"时间序列交叉验证得分: {ts_scores}")
- print(f"平均时间序列交叉验证得分: {ts_scores.mean():.4f}")
复制代码
4. 超参数调优
4.1 网格搜索 (Grid Search)
网格搜索通过遍历给定的超参数组合,找到最佳超参数。
- from sklearn.model_selection import GridSearchCV
- from sklearn.svm import SVC
- # 定义参数网格
- param_grid = {
- 'C': [0.1, 1, 10, 100],
- 'gamma': [1, 0.1, 0.01, 0.001],
- 'kernel': ['rbf', 'linear']
- }
- # 创建SVC模型
- svc = SVC(probability=True)
- # 创建网格搜索对象
- grid_search = GridSearchCV(estimator=svc, param_grid=param_grid, cv=5, verbose=2, n_jobs=-1)
- # 执行网格搜索
- grid_search.fit(X_train, y_train)
- # 输出最佳参数和得分
- print(f"最佳参数: {grid_search.best_params_}")
- print(f"最佳交叉验证得分: {grid_search.best_score_:.4f}")
- # 使用最佳模型进行预测
- best_model = grid_search.best_estimator_
- y_pred = best_model.predict(X_test)
- print(f"测试集准确率: {accuracy_score(y_test, y_pred):.4f}")
复制代码
4.2 随机搜索 (Random Search)
随机搜索在参数空间中随机采样一定数量的参数组合,通常比网格搜索更高效。
- from sklearn.model_selection import RandomizedSearchCV
- from scipy.stats import uniform, randint
- # 定义参数分布
- param_dist = {
- 'C': uniform(0.1, 100),
- 'gamma': uniform(0.001, 1),
- 'kernel': ['rbf', 'linear']
- }
- # 创建随机搜索对象
- random_search = RandomizedSearchCV(
- estimator=svc,
- param_distributions=param_dist,
- n_iter=20,
- cv=5,
- verbose=2,
- n_jobs=-1,
- random_state=42
- )
- # 执行随机搜索
- random_search.fit(X_train, y_train)
- # 输出最佳参数和得分
- print(f"最佳参数: {random_search.best_params_}")
- print(f"最佳交叉验证得分: {random_search.best_score_:.4f}")
- # 使用最佳模型进行预测
- best_model = random_search.best_estimator_
- y_pred = best_model.predict(X_test)
- print(f"测试集准确率: {accuracy_score(y_test, y_pred):.4f}")
复制代码
4.3 贝叶斯优化 (Bayesian Optimization)
贝叶斯优化是一种更高级的超参数调优方法,它使用贝叶斯方法来选择下一个要评估的参数组合。
- # 需要安装scikit-optimize: pip install scikit-optimize
- from skopt import BayesSearchCV
- from skopt.space import Real, Categorical, Integer
- # 定义搜索空间
- search_spaces = {
- 'C': Real(0.1, 100, prior='log-uniform'),
- 'gamma': Real(0.001, 1, prior='log-uniform'),
- 'kernel': Categorical(['rbf', 'linear'])
- }
- # 创建贝叶斯优化对象
- bayes_search = BayesSearchCV(
- estimator=svc,
- search_spaces=search_spaces,
- n_iter=20,
- cv=5,
- verbose=2,
- n_jobs=-1,
- random_state=42
- )
- # 执行贝叶斯优化
- bayes_search.fit(X_train, y_train)
- # 输出最佳参数和得分
- print(f"最佳参数: {bayes_search.best_params_}")
- print(f"最佳交叉验证得分: {bayes_search.best_score_:.4f}")
- # 使用最佳模型进行预测
- best_model = bayes_search.best_estimator_
- y_pred = best_model.predict(X_test)
- print(f"测试集准确率: {accuracy_score(y_test, y_pred):.4f}")
复制代码
5. 模型预测技术
5.1 分类模型预测
- from sklearn.ensemble import RandomForestClassifier
- # 训练随机森林分类器
- rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
- rf_classifier.fit(X_train, y_train)
- # 基本预测
- y_pred = rf_classifier.predict(X_test)
- print(f"前5个预测结果: {y_pred[:5]}")
复制代码- # 预测概率
- y_prob = rf_classifier.predict_proba(X_test)
- print(f"前5个样本的类别概率:\n{y_prob[:5]}")
- # 获取正类的概率
- y_prob_positive = y_prob[:, 1]
- print(f"前5个样本的正类概率: {y_prob_positive[:5]}")
复制代码- # 预测决策函数值(适用于支持决策函数的模型,如SVM)
- if hasattr(rf_classifier, 'decision_function'):
- y_decision = rf_classifier.decision_function(X_test)
- print(f"前5个样本的决策函数值: {y_decision[:5]}")
复制代码
5.2 回归模型预测
- from sklearn.ensemble import RandomForestRegressor
- # 训练随机森林回归器
- rf_regressor = RandomForestRegressor(n_estimators=100, random_state=42)
- rf_regressor.fit(X_train, y_train)
- # 预测
- y_pred = rf_regressor.predict(X_test)
- print(f"前5个预测结果: {y_pred[:5]}")
- print(f"前5个真实值: {y_test[:5]}")
复制代码
5.3 聚类模型预测
- from sklearn.cluster import KMeans
- from sklearn.datasets import make_blobs
- # 生成聚类数据
- X_blob, y_blob = make_blobs(n_samples=300, centers=4, random_state=42)
- # 训练KMeans聚类模型
- kmeans = KMeans(n_clusters=4, random_state=42)
- kmeans.fit(X_blob)
- # 预测簇标签
- y_cluster = kmeans.predict(X_blob)
- print(f"前5个样本的簇标签: {y_cluster[:5]}")
- # 获取簇中心
- centers = kmeans.cluster_centers_
- print(f"簇中心坐标:\n{centers}")
复制代码
6. 模型持久化
6.1 使用pickle保存和加载模型
- import pickle
- # 保存模型到文件
- with open('model.pkl', 'wb') as file:
- pickle.dump(rf_classifier, file)
- # 从文件加载模型
- with open('model.pkl', 'rb') as file:
- loaded_model = pickle.load(file)
- # 使用加载的模型进行预测
- y_pred_loaded = loaded_model.predict(X_test)
- print(f"使用加载模型的前5个预测结果: {y_pred_loaded[:5]}")
复制代码
6.2 使用joblib保存和加载模型
- from joblib import dump, load
- # 保存模型到文件
- dump(rf_classifier, 'model.joblib')
- # 从文件加载模型
- loaded_model = load('model.joblib')
- # 使用加载的模型进行预测
- y_pred_loaded = loaded_model.predict(X_test)
- print(f"使用加载模型的前5个预测结果: {y_pred_loaded[:5]}")
复制代码
7. 实战案例:使用真实数据集进行模型评估与预测
7.1 数据准备与探索
- import pandas as pd
- from sklearn.datasets import load_breast_cancer
- # 加载乳腺癌数据集
- cancer = load_breast_cancer()
- X = pd.DataFrame(cancer.data, columns=cancer.feature_names)
- y = pd.Series(cancer.target)
- # 查看数据基本信息
- print(f"数据形状: {X.shape}")
- print(f"特征名称: {X.columns.tolist()}")
- print(f"目标类别: {cancer.target_names}")
- print(f"类别分布:\n{y.value_counts()}")
- # 查看数据统计信息
- print("\n数据统计信息:")
- print(X.describe())
- # 划分训练集和测试集
- X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
复制代码
7.2 数据预处理
- from sklearn.preprocessing import StandardScaler
- from sklearn.pipeline import Pipeline
- # 创建预处理管道
- preprocessor = Pipeline([
- ('scaler', StandardScaler())
- ])
- # 预处理训练数据
- X_train_processed = preprocessor.fit_transform(X_train)
- # 预处理测试数据
- X_test_processed = preprocessor.transform(X_test)
复制代码
7.3 模型训练与评估
- from sklearn.ensemble import RandomForestClassifier
- from sklearn.svm import SVC
- from sklearn.linear_model import LogisticRegression
- # 定义多个模型
- models = {
- 'Logistic Regression': LogisticRegression(max_iter=10000, random_state=42),
- 'SVM': SVC(probability=True, random_state=42),
- 'Random Forest': RandomForestClassifier(random_state=42)
- }
- # 训练和评估每个模型
- results = {}
- for name, model in models.items():
- # 训练模型
- model.fit(X_train_processed, y_train)
-
- # 预测
- y_pred = model.predict(X_test_processed)
- y_prob = model.predict_proba(X_test_processed)[:, 1] if hasattr(model, 'predict_proba') else None
-
- # 评估模型
- accuracy = accuracy_score(y_test, y_pred)
- precision = precision_score(y_test, y_pred)
- recall = recall_score(y_test, y_pred)
- f1 = f1_score(y_test, y_pred)
-
- # 计算AUC(如果模型支持概率预测)
- if y_prob is not None:
- fpr, tpr, _ = roc_curve(y_test, y_prob)
- roc_auc = auc(fpr, tpr)
- else:
- roc_auc = None
-
- # 存储结果
- results[name] = {
- 'accuracy': accuracy,
- 'precision': precision,
- 'recall': recall,
- 'f1': f1,
- 'roc_auc': roc_auc
- }
-
- # 打印结果
- print(f"\n{name} 评估结果:")
- print(f"准确率: {accuracy:.4f}")
- print(f"精确率: {precision:.4f}")
- print(f"召回率: {recall:.4f}")
- print(f"F1分数: {f1:.4f}")
- if roc_auc is not None:
- print(f"AUC: {roc_auc:.4f}")
-
- # 打印分类报告
- print("\n分类报告:")
- print(classification_report(y_test, y_pred))
复制代码
7.4 模型比较与可视化
- import matplotlib.pyplot as plt
- import numpy as np
- # 准备比较数据
- metrics = ['accuracy', 'precision', 'recall', 'f1']
- model_names = list(results.keys())
- # 创建条形图
- x = np.arange(len(metrics))
- width = 0.25
- fig, ax = plt.subplots(figsize=(12, 6))
- for i, model in enumerate(model_names):
- values = [results[model][metric] for metric in metrics]
- ax.bar(x + i * width, values, width, label=model)
- # 添加标签和标题
- ax.set_xlabel('评估指标')
- ax.set_ylabel('分数')
- ax.set_title('模型性能比较')
- ax.set_xticks(x + width)
- ax.set_xticklabels(metrics)
- ax.legend()
- plt.tight_layout()
- plt.show()
- # 绘制ROC曲线
- plt.figure(figsize=(10, 8))
- for name, model in models.items():
- if hasattr(model, 'predict_proba'):
- y_prob = model.predict_proba(X_test_processed)[:, 1]
- fpr, tpr, _ = roc_curve(y_test, y_prob)
- roc_auc = auc(fpr, tpr)
- plt.plot(fpr, tpr, label=f'{name} (AUC = {roc_auc:.2f})')
- plt.plot([0, 1], [0, 1], 'k--')
- plt.xlim([0.0, 1.0])
- plt.ylim([0.0, 1.05])
- plt.xlabel('假正例率')
- plt.ylabel('真正例率')
- plt.title('ROC曲线比较')
- plt.legend(loc="lower right")
- plt.show()
复制代码
7.5 超参数调优
- # 选择最佳模型进行超参数调优
- best_model_name = max(results.keys(), key=lambda k: results[k]['f1'])
- print(f"选择 {best_model_name} 进行超参数调优")
- if best_model_name == 'Logistic Regression':
- model = LogisticRegression(max_iter=10000, random_state=42)
- param_grid = {
- 'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000],
- 'penalty': ['l1', 'l2'],
- 'solver': ['liblinear', 'saga']
- }
- elif best_model_name == 'SVM':
- model = SVC(probability=True, random_state=42)
- param_grid = {
- 'C': [0.1, 1, 10, 100],
- 'gamma': [1, 0.1, 0.01, 0.001],
- 'kernel': ['rbf', 'linear']
- }
- else: # Random Forest
- model = RandomForestClassifier(random_state=42)
- param_grid = {
- 'n_estimators': [50, 100, 200],
- 'max_depth': [None, 10, 20, 30],
- 'min_samples_split': [2, 5, 10],
- 'min_samples_leaf': [1, 2, 4]
- }
- # 创建网格搜索对象
- grid_search = GridSearchCV(
- estimator=model,
- param_grid=param_grid,
- cv=5,
- scoring='f1',
- verbose=1,
- n_jobs=-1
- )
- # 执行网格搜索
- grid_search.fit(X_train_processed, y_train)
- # 输出最佳参数和得分
- print(f"\n最佳参数: {grid_search.best_params_}")
- print(f"最佳交叉验证F1分数: {grid_search.best_score_:.4f}")
- # 使用最佳模型进行预测
- best_model = grid_search.best_estimator_
- y_pred = best_model.predict(X_test_processed)
- y_prob = best_model.predict_proba(X_test_processed)[:, 1] if hasattr(best_model, 'predict_proba') else None
- # 评估最佳模型
- accuracy = accuracy_score(y_test, y_pred)
- precision = precision_score(y_test, y_pred)
- recall = recall_score(y_test, y_pred)
- f1 = f1_score(y_test, y_pred)
- if y_prob is not None:
- fpr, tpr, _ = roc_curve(y_test, y_prob)
- roc_auc = auc(fpr, tpr)
- else:
- roc_auc = None
- print(f"\n调优后模型评估结果:")
- print(f"准确率: {accuracy:.4f}")
- print(f"精确率: {precision:.4f}")
- print(f"召回率: {recall:.4f}")
- print(f"F1分数: {f1:.4f}")
- if roc_auc is not None:
- print(f"AUC: {roc_auc:.4f}")
- # 打印分类报告
- print("\n分类报告:")
- print(classification_report(y_test, y_pred))
复制代码
7.6 特征重要性分析
- # 如果模型支持特征重要性,则进行分析
- if hasattr(best_model, 'feature_importances_'):
- # 获取特征重要性
- importances = best_model.feature_importances_
-
- # 创建特征重要性DataFrame
- feature_importance = pd.DataFrame({
- 'feature': X.columns,
- 'importance': importances
- }).sort_values('importance', ascending=False)
-
- # 打印前10个最重要的特征
- print("前10个最重要的特征:")
- print(feature_importance.head(10))
-
- # 可视化特征重要性
- plt.figure(figsize=(12, 8))
- plt.barh(feature_importance['feature'][:10], feature_importance['importance'][:10])
- plt.xlabel('重要性')
- plt.ylabel('特征')
- plt.title('特征重要性')
- plt.gca().invert_yaxis()
- plt.show()
- elif hasattr(best_model, 'coef_'):
- # 对于线性模型,使用系数的绝对值作为特征重要性
- if len(best_model.coef_.shape) == 1:
- coefficients = best_model.coef_
- else:
- coefficients = best_model.coef_[0]
-
- # 创建特征重要性DataFrame
- feature_importance = pd.DataFrame({
- 'feature': X.columns,
- 'importance': np.abs(coefficients)
- }).sort_values('importance', ascending=False)
-
- # 打印前10个最重要的特征
- print("前10个最重要的特征:")
- print(feature_importance.head(10))
-
- # 可视化特征重要性
- plt.figure(figsize=(12, 8))
- plt.barh(feature_importance['feature'][:10], feature_importance['importance'][:10])
- plt.xlabel('系数绝对值')
- plt.ylabel('特征')
- plt.title('特征重要性')
- plt.gca().invert_yaxis()
- plt.show()
复制代码
8. 最佳实践和常见陷阱
8.1 数据泄露 (Data Leakage)
数据泄露是指测试集的信息在训练过程中被使用,导致模型评估过于乐观。
- # 错误示例:在划分数据集之前进行标准化
- X_wrong = StandardScaler().fit_transform(X) # 错误:使用整个数据集进行拟合
- X_train_wrong, X_test_wrong, y_train, y_test = train_test_split(X_wrong, y, test_size=0.2, random_state=42)
- # 正确示例:先划分数据集,然后分别对训练集和测试集进行标准化
- X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
- scaler = StandardScaler()
- X_train_correct = scaler.fit_transform(X_train) # 正确:只使用训练集进行拟合
- X_test_correct = scaler.transform(X_test) # 正确:使用相同的变换应用于测试集
复制代码
8.2 类别不平衡处理
类别不平衡会导致模型偏向多数类,影响模型性能。
- from imblearn.over_sampling import SMOTE
- from imblearn.under_sampling import RandomUnderSampler
- from imblearn.pipeline import Pipeline as ImbPipeline
- # 创建不平衡数据集
- X_imb, y_imb = make_classification(n_samples=1000, n_classes=2, weights=[0.9, 0.1], random_state=42)
- # 查看类别分布
- print(f"原始数据类别分布: {pd.Series(y_imb).value_counts().to_dict()}")
- # 方法1:使用class_weight参数
- model_balanced = LogisticRegression(class_weight='balanced', random_state=42)
- model_balanced.fit(X_imb, y_imb)
- # 方法2:使用过采样和欠采样
- resampling = ImbPipeline([
- ('oversample', SMOTE(random_state=42)),
- ('undersample', RandomUnderSampler(random_state=42)),
- ('classifier', LogisticRegression(random_state=42))
- ])
- resampling.fit(X_imb, y_imb)
复制代码
8.3 交叉验证的正确使用
交叉验证应该在整个预处理流程之后进行,而不是之前。
- # 错误示例:先进行交叉验证,再进行预处理
- cv_scores_wrong = cross_val_score(LogisticRegression(), X, y, cv=5)
- # 正确示例:使用Pipeline将预处理和模型结合,然后进行交叉验证
- pipeline = Pipeline([
- ('scaler', StandardScaler()),
- ('classifier', LogisticRegression())
- ])
- cv_scores_correct = cross_val_score(pipeline, X, y, cv=5)
- print(f"错误交叉验证得分: {cv_scores_wrong}")
- print(f"正确交叉验证得分: {cv_scores_correct}")
复制代码
8.4 模型选择与评估
在模型选择和评估过程中,应该使用独立的数据集进行最终评估。
- # 划分数据为训练集、验证集和测试集
- X_temp, X_test, y_temp, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
- X_train, X_val, y_train, y_val = train_test_split(X_temp, y_temp, test_size=0.25, random_state=42)
- print(f"训练集大小: {X_train.shape[0]}")
- print(f"验证集大小: {X_val.shape[0]}")
- print(f"测试集大小: {X_test.shape[0]}")
- # 在训练集上训练模型,在验证集上选择最佳模型
- pipeline = Pipeline([
- ('scaler', StandardScaler()),
- ('classifier', LogisticRegression())
- ])
- pipeline.fit(X_train, y_train)
- val_score = pipeline.score(X_val, y_val)
- # 在测试集上评估最终模型
- test_score = pipeline.score(X_test, y_test)
- print(f"验证集得分: {val_score:.4f}")
- print(f"测试集得分: {test_score:.4f}")
复制代码
9. 结论
本文全面介绍了使用scikit-learn进行模型评估与预测的核心技能,从基础指标到高级技巧。我们学习了各种评估指标的使用方法、交叉验证技术、超参数调优策略以及模型预测和持久化的方法。通过实战案例,我们展示了如何将这些技术应用到真实数据集上,并讨论了常见的陷阱和最佳实践。
掌握这些技能将帮助你更好地评估机器学习模型的性能,选择最适合特定问题的模型,并将模型应用到实际场景中。记住,模型评估与预测是机器学习项目成功的关键环节,需要仔细考虑数据特点、业务需求和评估指标的选择。
随着你经验的积累,你将能够更加熟练地运用这些技术,并探索更高级的方法来解决复杂的机器学习问题。希望本文能为你的机器学习之旅提供有价值的指导。 |
|