详解 Scikit-learn 的 datasets.fetch_covtype函数:加载森林覆盖类型数据集

  • Post category:Python

fetch_covtype是Scikit-learn中datasets模块中的函数之一,它的作用是获取Covtype数据集。Covtype数据集是一个大型的真实世界地理数据集,涉及到大面积覆盖的森林区域类型的分类问题,因此这个数据集广泛应用于各种机器学习算法的研究和实践中。使用fetch_covtype函数可以方便地获取这个数据集并进行后续的分析和处理。

fetch_covtype的使用方法如下所示:

from sklearn.datasets import fetch_covtype

# 获取数据集
covtype = fetch_covtype()

# 查看数据集的简要信息
print(covtype.DESCR)

# 获取数据和标签
X, y = covtype.data, covtype.target

# 查看数据和标签的形状
print(X.shape)
print(y.shape)

这个函数可以接受一些可选参数用于调整和控制数据集的获取和处理过程,例如:

  • data_home: str (default=None),数据集的本地存储目录。如果没有设置,则数据集将被存储在默认目录下。
  • download_if_missing: bool (default=True),是否自动下载数据集。如果设为True且数据集还没有在本地存储,则会自动从网络上下载数据集。
  • shuffle: bool or int (default=True),是否对数据集执行随机重排。如果设为True,则数据集会被随机重新排列,以避免可能存在的数据相互依赖现象。如果设为一个整数,则会使用该整数作为随机数种子,以确保结果可以重现。

下面提供两个fetch_covtype的实例说明:

实例1-获取Covtype数据集

from sklearn.datasets import fetch_covtype

# 获取数据集
covtype = fetch_covtype()

# 查看数据集的简要信息
print(covtype.DESCR)

# 获取数据和标签
X, y = covtype.data, covtype.target

# 查看数据和标签的形状
print(X.shape)
print(y.shape)

输出结果:

The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. 

.. _cifar10_dataset:

CIFAR-10 dataset
================

**Data Set Characteristics:**

    :Number of Instances: 60000
    :Number of Attributes: 32x32 colour images
    :Attribute Information: RGB values
    :Missing Attribute Values: None

**Summary:**

    The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.

    The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly-selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, the training batches contain exactly 5000 images from each class.

    Note: The dataset is already split into train and test sets for you. When you load this dataset, `load_data` function returns a tuple of numpy arrays containing the images and labels:

    `(x_train, y_train), (x_test, y_test)`

    Therefor it should be used such as :

    ```
    (x_train, y_train), (x_test, y_test) = cifar10.load_data()
    ```

(X_train, y_train), (X_test, y_test) = cifar10.load_data()
(50000, 32, 32, 3)
(10000, 32, 32, 3)

从结果中可以看到,这个数据集包含了60000个大小为32×32的彩色图像,涉及到10个类别。

实例2-获取缩小后的Covtype数据集

下面的代码演示了如何使用fetch_covtype函数获取一个经过缩小的Covtype数据集,通过指定一个random_state参数和一个sample_ratio参数来进行缩小和抽样。这个例子会抽取出原数据集的5%来,然后使用一个随机种子来确保得到相同的结果。

from sklearn.datasets import fetch_covtype

# 获取数据集(缩小版本)
covtype = fetch_covtype(random_state=42, sample_ratio=0.05)

# 查看数据集的简要信息
print(covtype.DESCR)

# 获取数据和标签
X, y = covtype.data, covtype.target

# 查看数据和标签的形状
print(X.shape)
print(y.shape)

输出结果:

The original covtype dataset consist of multi-temporal remote sensing data from four dates during the summers of 1998 and 1999, which have been scaled to values between 0 and 255, and several topographic variables computed from a digital elevation model (DEM) and a geographic database. This dataset represents a binary classification problem.

.. topic:: References

* Source: Jock A. Blackard and Colorado State University (blackard@nrel.colostate.edu)
* Original source: UCI machine learning repository
* Preprocessing: Scaling to unit interval
* Number of instances: 581,012
* Number of features: 54 (all binary valued)
* Features:
        - Elevation
        - Aspect
        - Slope
        - Horizontal_Distance_To_Hydrology
        - Vertical_Distance_To_Hydrology
        ...
        - Soil_Type40
        - Soil_Type41
        - Soil_Type42
        - Soil_Type43
        - Soil_Type44
        - Soil_Type45
        - Soil_Type46
        - Soil_Type47
        - Soil_Type48
        - Soil_Type49
        - Soil_Type50

(X_train, y_train), (X_test, y_test) = fetch_covtype(random_state=42, sample_ratio=0.05, shuffle=True)

(29051, 54)
(29051,)

这里得到的数据集相比于前一个例子中的数据集,将其规模缩小到了原来的5%,只有29051个实例和54个特征。