Python HTML解析器BeautifulSoup用法实例详解【爬虫解析器】

Post published:2023年5月13日
Post category:Python

Python HTML解析器BeautifulSoup用法实例详解【爬虫解析器】

简介

BeautifulSoup是一个Python的HTML/XML解析库，可以从HTML或XML中提取信息。它通过解析文档为DOM树，支持CSS选择器和XPath表达式，提供了十分方便的数据抽取方法。

BeautifulSoup支持Python标准库中的HTML解析器，也支持一些第三方的HTML解析器，例如lxml、html5lib等。同时，BeautifulSoup还可以在解析嵌套式数据时自动进行编码转换，很好地适应了各种编码的HTML文档。

安装

通过pip命令安装BeautifulSoup：

pip install BeautifulSoup4
pip install lxml
pip install html5lib

基本使用

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'html.parser')

# 格式化输出
print(soup.prettify())

# 获取title
print(soup.title)

# 获取title的名称
print(soup.title.name)

# 获取title包含的文本内容
print(soup.title.string)

# 获取第一个p标签
print(soup.p)

# 获取第一个a标签的href属性值
print(soup.a['href'])

# 获取所有的a标签
print(soup.find_all('a'))

# 获取class等于sister的所有a标签
print(soup.find_all('a', {'class': 'sister'}))

# 获取id等于link1的a标签
print(soup.find_all('a', {'id': 'link1'}))

运行结果如下：

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>

<title>The Dormouse's story</title>
title
The Dormouse's story
<p class="title"><b>The Dormouse's story</b></p>
http://example.com/elsie
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

示例1：获取页面中所有的图片URL

from bs4 import BeautifulSoup
import requests

url = 'http://www.nationalgeographic.com.cn/animals/'

# 获取页面内容
html = requests.get(url).content

# 解析为Doc树
soup = BeautifulSoup(html, 'html.parser')

# 获取所有图片
img_list = soup.find_all('img')

# 打印所有图片URL
for img in img_list:
    print(img['src'])

示例2：获取糗事百科页面中所有的段子和作者

from bs4 import BeautifulSoup
import requests

url = 'https://www.qiushibaike.com/'

# 获取页面内容
html = requests.get(url).content

# 解析为Doc树
soup = BeautifulSoup(html, 'html.parser')

# 获取所有段子和作者
articles = []
author_list = []
content_list = []
for article in soup.find_all('div', {'class': 'article'}):
    author = article.find('span', {'class': 'author'}).get_text(strip=True)
    content = article.find('div', {'class': 'content'}).get_text(strip=True)
    author_list.append(author)
    content_list.append(content)

# 打印所有段子和作者
for i in range(len(author_list)):
    print('作者：%s，段子：%s' % (author_list[i], content_list[i]))

总结

本文主要介绍了Python HTML解析器BeautifulSoup的基本用法，包括安装、HTML解析、DOM树的遍历等操作。同时，通过两个示例，说明了BeautifulSoup在实际爬虫开发中的应用，这些基础内容都是爬虫开发过程中必须掌握的技能点。

Tags: dict, Python