Python HTML解析器BeautifulSoup用法实例详解【爬虫解析器】
简介
BeautifulSoup是一个Python的HTML/XML解析库,可以从HTML或XML中提取信息。它通过解析文档为DOM树,支持CSS选择器和XPath表达式,提供了十分方便的数据抽取方法。
BeautifulSoup支持Python标准库中的HTML解析器,也支持一些第三方的HTML解析器,例如lxml、html5lib等。同时,BeautifulSoup还可以在解析嵌套式数据时自动进行编码转换,很好地适应了各种编码的HTML文档。
安装
通过pip命令安装BeautifulSoup:
pip install BeautifulSoup4
pip install lxml
pip install html5lib
基本使用
from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
# 格式化输出
print(soup.prettify())
# 获取title
print(soup.title)
# 获取title的名称
print(soup.title.name)
# 获取title包含的文本内容
print(soup.title.string)
# 获取第一个p标签
print(soup.p)
# 获取第一个a标签的href属性值
print(soup.a['href'])
# 获取所有的a标签
print(soup.find_all('a'))
# 获取class等于sister的所有a标签
print(soup.find_all('a', {'class': 'sister'}))
# 获取id等于link1的a标签
print(soup.find_all('a', {'id': 'link1'}))
运行结果如下:
<html>
<head>
<title>
The Dormouse's story
</title>
</head>
<body>
<p class="title">
<b>
The Dormouse's story
</b>
</p>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">
Elsie
</a>
,
<a class="sister" href="http://example.com/lacie" id="link2">
Lacie
</a>
and
<a class="sister" href="http://example.com/tillie" id="link3">
Tillie
</a>
;
and they lived at the bottom of a well.
</p>
<p class="story">
...
</p>
</body>
</html>
<title>The Dormouse's story</title>
title
The Dormouse's story
<p class="title"><b>The Dormouse's story</b></p>
http://example.com/elsie
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
示例1:获取页面中所有的图片URL
from bs4 import BeautifulSoup
import requests
url = 'http://www.nationalgeographic.com.cn/animals/'
# 获取页面内容
html = requests.get(url).content
# 解析为Doc树
soup = BeautifulSoup(html, 'html.parser')
# 获取所有图片
img_list = soup.find_all('img')
# 打印所有图片URL
for img in img_list:
print(img['src'])
示例2:获取糗事百科页面中所有的段子和作者
from bs4 import BeautifulSoup
import requests
url = 'https://www.qiushibaike.com/'
# 获取页面内容
html = requests.get(url).content
# 解析为Doc树
soup = BeautifulSoup(html, 'html.parser')
# 获取所有段子和作者
articles = []
author_list = []
content_list = []
for article in soup.find_all('div', {'class': 'article'}):
author = article.find('span', {'class': 'author'}).get_text(strip=True)
content = article.find('div', {'class': 'content'}).get_text(strip=True)
author_list.append(author)
content_list.append(content)
# 打印所有段子和作者
for i in range(len(author_list)):
print('作者:%s,段子:%s' % (author_list[i], content_list[i]))
总结
本文主要介绍了Python HTML解析器BeautifulSoup的基本用法,包括安装、HTML解析、DOM树的遍历等操作。同时,通过两个示例,说明了BeautifulSoup在实际爬虫开发中的应用,这些基础内容都是爬虫开发过程中必须掌握的技能点。