python3实现网络爬虫之BeautifulSoup使用详解

Python3实现网络爬虫之BeautifulSoup使用详解

1. 简介

BeautifulSoup是一种用于解析HTML和XML文档的Python 库。它可将不规范的HTML文档转换为规范的XML或HTML文档。它支持Python标准库中的HTML 解析器，还支持 lxml HTML 解析器、lxml XML 解析器和 html5lib 解析器。

2. 安装

使用pip安装：

pip install beautifulsoup4

3. 示例

3.1 爬取主页

下面是一段示例代码，用于爬取我的博客主页的文章标题和链接：

import requests
from bs4 import BeautifulSoup

url = 'https://www.yingvickycao.com/'

response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

articles = soup.find_all('h2', class_='entry-title')

for article in articles:
    title = article.a.text.strip()
    link = article.a['href']
    print(title, link)

首先，我们使用requests库发出HTTP请求，获取博客主页的HTML文档。然后，我们使用BeautifulSoup库解析HTML文档，将其转换为Python可操作的对象。接着，我们使用find_all方法查找所有class为entry-title的h2标签，并使用strip方法去除字符串两端的空格。最后，我们将文章标题和链接打印出来。

3.2 爬取图片

下面是一段示例代码，用于爬取一个网站上的所有图片：

import requests
from bs4 import BeautifulSoup

url = 'https://www.example.com/'

response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

images = soup.find_all('img')

for image in images:
    src = image['src']
    image_response = requests.get(src)
    with open(src.split('/')[-1], 'wb') as f:
        f.write(image_response.content)

首先，我们使用requests库发出HTTP请求，获取网站的HTML文档。然后，我们使用BeautifulSoup库解析HTML文档，将其转换为Python可操作的对象。接着，我们使用find_all方法查找所有img标签，并获取它们的链接。最后，我们使用requests库发出HTTP请求，获取图片文件，并使用open函数将其保存到本地。