如何用python做逐步回归

如何用Python做逐步回归

逐步回归（Stepwise Regression）是一种逐步选择变量的方法，可以优化多元线性回归模型的效果。在该方法中，每一步都添加一个新的自变量或删除一个自变量，并重新计算回归系数。本文将介绍如何使用Python进行逐步回归分析。

1. 准备数据

我们需要准备一组包含多个自变量和一个因变量的数据集。通常情况下，我们可以使用Pandas库将数据读取到DataFrame中。

import pandas as pd

data = pd.read_csv('data.csv')
X = data.iloc[:, :-1]
y = data.iloc[:, -1]

在上述代码中，我们假设数据文件名为”data.csv”，其中最后一列为因变量，前面几列为自变量。我们将自变量和因变量分别保存到X和y变量中。

2. 构建模型

我们可以使用statsmodels库构建逐步回归模型。

from sklearn.linear_model import LinearRegression
import statsmodels.api as sm

def stepwise_regression(X, y,
                        initial_list=[],
                        threshold_in=0.01,
                        threshold_out=0.05,
                        verbose=False):

    included = list(initial_list)
    while True:
        changed = False
        excluded = list(set(X.columns) - set(included))

        new_pval = pd.Series(index=excluded)
        for new_column in excluded:
            model = sm.OLS(y, sm.add_constant(pd.DataFrame(X[included+[new_column]]))).fit()
            new_pval[new_column] = model.pvalues[new_column]

        min_pval = new_pval.min()
        if min_pval < threshold_in:
            best_feature = new_pval.argmin()
            included.append(best_feature)
            changed = True

            if verbose:
                print('Add {:30} with p-value {:.6}'.format(best_feature, min_pval))

        model = sm.OLS(y, sm.add_constant(pd.DataFrame(X[included]))).fit()
        pvalues = model.pvalues.iloc[1:]
        max_pval = pvalues.max()
        if max_pval > threshold_out:
            changed = True
            worst_feature = pvalues.argmax()
            included.remove(worst_feature)

            if verbose:
                print('Drop {:30} with p-value {:.6}'.format(worst_feature, max_pval))
        if not changed:
            break

    return included

result = stepwise_regression(X, y, verbose=True)

在上述代码中，我们定义了一个stepwise_regression函数，该函数接受自变量和因变量，以及一些参数，如初始自变量列表、阈值等。函数将返回一个包含最终选定的自变量名的列表。

3. 分析结果

print(result)

最终，我们将得到包含最终选定的自变量的列表。

4. 示例

下面我们将使用一个示例来说明如何使用Python做逐步回归。

import pandas as pd
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm

# 读取数据
data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data', 
                   delim_whitespace=True, header=None)
data.columns = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']

# 自变量和因变量
X = data.iloc[:, :-1]
y = data.iloc[:, -1]

# 构建逐步回归模型
def stepwise_regression(X, y,
                        initial_list=[],
                        threshold_in=0.01,
                        threshold_out=0.05,
                        verbose=False):

    included = list(initial_list)
    while True:
        changed = False
        excluded = list(set(X.columns) - set(included))

        new_pval = pd.Series(index=excluded)
        for new_column in excluded:
            model = sm.OLS(y, sm.add_constant(pd.DataFrame(X[included+[new_column]]))).fit()
            new_pval[new_column] = model.pvalues[new_column]

        min_pval = new_pval.min()
        if min_pval < threshold_in:
            best_feature = new_pval.argmin()
            included.append(best_feature)
            changed = True

            if verbose:
                print('Add {:30} with p-value {:.6}'.format(best_feature, min_pval))

        model = sm.OLS(y, sm.add_constant(pd.DataFrame(X[included]))).fit()
        pvalues = model.pvalues.iloc[1:]
        max_pval = pvalues.max()
        if max_pval > threshold_out:
            changed = True
            worst_feature = pvalues.argmax()
            included.remove(worst_feature)

            if verbose:
                print('Drop {:30} with p-value {:.6}'.format(worst_feature, max_pval))
        if not changed:
            break

    return included

# 运行模型
result = stepwise_regression(X, y, verbose=True)

# 输出结果
print(result)

在这个示例中，我们从UCI机器学习数据集中读取了Boston房价数据集，然后使用上面介绍的stepwise_regression函数构建了一个逐步回归模型。最终，我们得到一个包含了所有选定自变量名字的列表。

当然，这只是一个简单的示例。在实际问题中，可能需要对数据进行更多的预处理和清洗，并选择更合适的自变量和阈值，以获得更好的结果。