详解 Scikit-learn 的 feature_selection.f_classif函数：计算 F 值和 p 值

sklearn.feature_selection.f_classif是Scikit-learn库中的一个特征选择函数，用于选择数据集中最具有预测能力的特征子集，从而提高机器学习模型的性能。它使用方差分析（ANOVA）来计算特征与目标变量之间的显著性，然后根据显著性对特征进行排序和选择。该函数主要用于分类问题，可以处理带有离散或连续目标变量的数据集。

使用sklearn.feature_selection.f_classif函数的方法如下：

导入库和数据集：

from sklearn.datasets import load_iris from sklearn.feature_selection import SelectKBest from sklearn.feature_selection import f_classif

加载数据集并分为特征和目标变量：

iris = load_iris() X = iris.data y = iris.target

使用SelectKBest函数选择k个最具有预测能力的特征：

selector = SelectKBest(f_classif, k=3) X_new = selector.fit_transform(X, y)

在上述代码中，我们选择了3个最具有预测能力的特征作为新的特征子集。

模型训练：

“`
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

X_train, X_test, y_train, y_test = train_test_split(X_new, y, test_size=0.3, random_state=42)

knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
“`

在上述代码中，我们使用K最近邻分类算法进行模型训练和预测。

分类性能评估：

“`
from sklearn.metrics import accuracy_score, confusion_matrix

acc = accuracy_score(y_test, y_pred)
print(“Accuracy:”, acc)

cm = confusion_matrix(y_test, y_pred)
print(“Confusion Matrix:\n”, cm)
“`

在上述代码中，我们计算了模型的准确率和混淆矩阵，以评估模型的性能。

下面给出另外两个使用sklearn.feature_selection.f_classif函数的实例：

实例1：使用sklearn.feature_selection.f_classif函数选择信用卡违约数据集中最具有预测能力的特征：

导入库和数据集：

“`
import pandas as pd
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif

df = pd.read_csv(‘https://archive.ics.uci.edu/ml/machine-learning-databases/00350/default%20of%20credit%20card%20clients.xls’, header=1)
X = df.iloc[:,1:-1]
y = df.iloc[:,-1]
“`

使用SelectKBest函数选择k个最具有预测能力的特征：

selector = SelectKBest(f_classif, k=5) X_new = selector.fit_transform(X, y)

在上述代码中，我们选择了5个最具有预测能力的特征作为新的特征子集。

实例2：使用sklearn.feature_selection.f_classif函数选择波士顿房价数据集中最具有预测能力的特征：

导入库和数据集：

“`
import pandas as pd
from sklearn.datasets import load_boston
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif

boston = load_boston()
X = pd.DataFrame(boston.data, columns=boston.feature_names)
y = boston.target
“`

使用SelectKBest函数选择k个最具有预测能力的特征：

selector = SelectKBest(f_classif, k=3) X_new = selector.fit_transform(X, y)

在上述代码中，我们选择了3个最具有预测能力的特征作为新的特征子集。

总之，使用sklearn.feature_selection.f_classif函数可以帮助我们选择最具有预测能力的特征子集，从而提高机器学习模型的性能。

你可能也喜欢

scikit-learn报”ValueError: ‘auto’ is not a valid scoring value. Valid options are {scoring_options}. “的原因以及解决办法

详解 Scikit-learn 的 preprocessing.QuantileTransformer函数：分位数转换器

详解 Scikit-learn 的 impute.KNNImputer函数：缺失值填充器