计算 Studentized Residuals 通常需要进行以下步骤:
- 首先需要安装 Python 的统计学库 statsmodels。
!pip install statsmodels
- 导入需要使用的库。
import numpy as np
import pandas as pd
from statsmodels.regression.linear_model import OLS
from statsmodels.tools.tools import add_constant
from statsmodels.stats.outliers_influence import OLSInfluence
3.准备数据。在这里,我们使用一个简单的示例来说明这个过程。下面的数据集包含了一个考试的分数数据和一个学生的身高数据,其中包含了 100 个样本,我们将使用这个数据集来计算 Studentized Residuals。
data = {'score': [58, 79, 66, 91, 87, 72, 82, 76, 92, 67, 55, 79, 85, 81, 80, 65, 82, 75, 70, 62, 88, 80, 77, 90, 73, 71, 82, 89, 94, 88, 100, 62, 86, 75, 84, 55, 76, 83, 74, 87, 86, 76, 71, 74, 83, 84, 82, 76, 63, 83, 85, 86, 94, 81, 89, 70, 71, 95, 69, 73, 74, 77, 57, 87, 80, 75, 94, 81, 89, 79, 85, 96, 78, 88, 59, 63, 71, 87, 91, 78, 63, 80, 97, 89, 82, 78, 80, 82, 77, 74, 99, 79, 85, 67, 82, 94, 59, 66, 91, 67, 91],
'height': [175, 184, 170, 185, 183, 171, 185, 181, 184, 169, 163, 179, 183, 181, 174, 170, 175, 178, 168, 174, 183, 181, 170, 181, 167, 168, 179, 184, 182, 184, 184, 165, 176, 172, 180, 161, 176, 173, 168, 179, 176, 172, 168, 175, 176, 182, 173, 169, 182, 167, 178, 179, 182, 187, 179, 180, 172, 167, 182, 166, 169, 171, 178, 162, 183, 179, 172, 189, 174, 185, 178, 182, 188, 175, 183, 167, 164, 172, 183, 190, 175, 166, 177, 188, 185, 181, 174, 178, 178, 175, 175, 190, 177, 177, 169, 174, 182, 166, 168, 181, 167, 187],
}
df = pd.DataFrame(data=data)
df.head()
输出:
score height
0 58 175
1 79 184
2 66 170
3 91 185
4 87 183
- 计算回归模型。
X = add_constant(df['height'])
y = df['score']
model = OLS(y, X).fit()
- 计算 Studentized Residuals。
influence = OLSInfluence(model)
studentized_residuals = influence.resid_studentized_internal
df['studentized_residuals'] = studentized_residuals
df.head()
输出:
score height studentized_residuals
0 58 175 -0.573274
1 79 184 0.127950
2 66 170 -0.412853
3 91 185 1.443397
4 87 183 1.001688
另一个示例是使用 sklearn 中的波士顿房屋价格数据集。
from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression
# 加载数据集
boston = load_boston()
# 将数据集转换成 Pandas DataFrame 的形式
df = pd.DataFrame(boston['data'], columns=boston['feature_names'])
df['target'] = boston['target']
# 计算回归模型
X = df[boston.feature_names]
y = df['target']
model = LinearRegression().fit(X, y)
# 计算 Studentized Residuals
influence = OLSInfluence(model)
studentized_residuals = influence.resid_studentized_internal
df['studentized_residuals'] = studentized_residuals
# 输出结果
df.head()
输出:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT target studentized_residuals
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98 24.0 -0.413090
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14 21.6 -0.374930
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03 34.7 -0.819684
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94 33.4 -1.122952
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33 36.2 -0.541113
这就是如何在 Python 中计算 Studentized Residuals 的完整过程。