详解 Scikit-learn 的 model_selection.RandomizedSearchCV函数：随机搜索超参数

sklearn.model_selection.RandomizedSearchCV 是 Scikit-learn（简称 sklearn）中实现随机化搜索的函数，用于自动化地调整模型的超参数。它通过在给定的超参数空间中进行随机搜索，并返回一组最优的超参数，使得评价指标（如准确率或AUC）最优化。

以下是 sklearn.model_selection.RandomizedSearchCV 的使用方式：

from sklearn.model_selection import RandomizedSearchCV

#创建要搜索的随机参数空间
param_distribution = {"max_depth": [1, 2, 3, 4, 5],
                      "min_samples_split": [1, 2, 3, 4, 5],
                      "n_estimators": [10, 20, 30, 40, 50]}

#实例化RandomizedSearchCV
random_search = RandomizedSearchCV(estimator=model, 
                                   param_distributions=param_distribution, 
                                   n_iter=10, 
                                   scoring='roc_auc', 
                                   n_jobs=-1, 
                                   cv=5)

#将数据拟合到模型中
random_search.fit(X_train, y_train)

#输出最优参数
best_params = random_search.best_params_
print(best_params)

estimator：所要使用的分类器，除了传入模型名称外，还可以传入自定义的分类器。例如，如果要使用随机森林，则可以传入 ensemble.RandomForestClassifier()。
param_distributions：要搜索的参数空间，参数需要以字典的形式进行传递，并且字典的key需要是已经在传入的分类器中定义的参数名称。例如，如果要为随机森林调整max_depth，min_samples_split和n_estimators这些参数，则可以通过以下方式定义参数空间：

param_distribution = {"max_depth": [1, 2, 3, 4, 5],
                      "min_samples_split": [1, 2, 3, 4, 5],
                      "n_estimators": [10, 20, 30, 40, 50]}

n_iter：随机搜索的次数，超参数的随机组合数量。
scoring：评价模型好坏的指标，可以传入多个指标（使用字符串或可调用函数形式）。
n_jobs：并行运算的数量。
cv：交叉验证的策略（默认为5折交叉验证）。

接下来，我们将为您提供两个 RandomizedSearchCV 函数的实际例子

实例一：使用 Logistic Regression 算法处理 Iris 数据集

from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import RandomizedSearchCV
import numpy as np

iris = load_iris()
X, y = iris.data, iris.target

model = LogisticRegression()  # 创建分类器

# Define hyperparameters space
param_grid = {'C': np.logspace(-4, 4, 20),
              'penalty': ['l1', 'l2'],
              'fit_intercept': [True, False]}

# 实例化 RandomizedSearchCV 
random_search = RandomizedSearchCV(estimator=model, 
                                   param_distributions=param_grid, 
                                   n_iter=10, 
                                   scoring='accuracy', 
                                   n_jobs=-1, 
                                   cv=5)

# 将数据拟合到模型中
random_search.fit(X, y)

# 输出search到的最优参数
print(random_search.best_params_)

这里，我们使用Logistic Regression分类算法，并载入了经典的鸢尾花数据集（iris data），并对模型的超参数进行随机搜索。我们基于以下范围跑了 20 次随机搜索，有罚项（penalty）、L1 形式和 L2 形式的惩罚，拟合截距（fit_intercept），C 的乘数组合全部取对数。

实例二：使用 Random Forest 算法处理红酒质量数据集

from sklearn.datasets import load_wine
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier

wine = load_wine()
X, y = wine.data, wine.target

model = RandomForestClassifier()     # 创建随机森林分类器

# Define hyperparameters space
param_grid = {'n_estimators': [100, 200, 300, 400, 500],
            'max_depth': [1, 2, 3, 4, 5],
            'min_samples_split': [2, 5, 10, 15, 20],
            'min_samples_leaf': [1, 2, 5, 10],
            'max_features': ['auto', 'sqrt', 'log2'],
            'bootstrap': [True, False]}

# 实例化 RandomizedSearchCV 
random_search = RandomizedSearchCV(estimator=model, 
                                   param_distributions=param_grid, 
                                   n_iter=10, 
                                   scoring='accuracy', 
                                   n_jobs=-1, 
                                   cv=5)

# 将数据拟合到模型中
random_search.fit(X, y)

# 输出search到的最优参数 
print(random_search.best_params_)

在这个例子中，我们选择了随机森林分类器，并选取了几个超参数作为随机搜索空间。我们基于以下参数范围跑了10次随机搜索: n_estimators, max_depth (每棵树的最大深度), min_samples_split (拆分一个节点所需要的最小观察值数量), min_samples_leaf(一个节点必须具有的最小样本数), max_features (决策树的最大特征数), and bootstrap(应该在构建树时对观察值进行采样吗)。

你可能也喜欢

scikit-learn报”ValueError: When using ‘sparse’, the data must be in CSR format. Got {input_format}. “的原因以及解决办法

详解 Scikit-learn 的 datasets.load_digits函数：加载手写数字数据集

scikit-learn报”ValueError: Too few samples in class {class_label}. Got {n_samples}, while at least {n} samples are required for estimation. “的原因以及解决办法