python爬取内容存入Excel实例

下面是一个Python爬取内容存入Excel的实例教程。

正文

环境准备

要使用这个方法，需要先安装Python和pandas模块。可以通过pip命令来安装pandas模块。

在命令行中输入以下命令安装pandas模块：

pip install pandas

爬取内容

现在，我们来编写一个简单的Python程序，用于从网站上爬取数据。

import requests
from bs4 import BeautifulSoup

url = 'https://www.example.com/'
html = requests.get(url).content
soup = BeautifulSoup(html, 'html.parser')

title_list = []
for title in soup.find_all('h2'):
    title_list.append(title.get_text())

print(title_list)

在这个例子中，我们首先使用requests库来获取指定网站的html内容。然后使用BeautifulSoup库来解析html内容并获取所需的数据。我们在这里尝试提取所有h2标签的文本，并将它们存储在一个列表中。最后，我们通过打印这个列表来查看结果。

写入Excel

我们已经成功地从网站上获取数据了，现在我们需要将它们存储到Excel文件中。我们可以使用pandas库将数据存储到Excel文件中。

import pandas as pd

df = pd.DataFrame({'Title': title_list})
df.to_excel('titles.xlsx', index=False)

在这个例子中，我们使用pandas库创建一个DataFrame对象，将标题列表加入到这个对象中，并将结果存储到一个名为“titles.xlsx”的Excel文件中。注意，我们设置了index=False，以确保我们的Excel文件中不包含索引列。

完整代码

import requests
from bs4 import BeautifulSoup
import pandas as pd

url = 'https://www.example.com/'
html = requests.get(url).content
soup = BeautifulSoup(html, 'html.parser')

title_list = []
for title in soup.find_all('h2'):
    title_list.append(title.get_text())

df = pd.DataFrame({'Title': title_list})
df.to_excel('titles.xlsx', index=False)

示例说明

我们可以通过修改url和需要提取的标签来适应不同网站的需求。以下是一个提取github trending中的项目名的例子。

import requests
from bs4 import BeautifulSoup
import pandas as pd

url = 'https://github.com/trending'
html = requests.get(url).content
soup = BeautifulSoup(html, 'html.parser')

repo_list = []
for repo in soup.find_all('h1', {'class': 'h3 lh-condensed'}):
    repo_list.append(repo.get_text(strip=True).replace('\n', ' '))

df = pd.DataFrame({'Repository': repo_list})
df.to_excel('github_trending.xlsx', index=False)

在这个例子中，我们使用类名’h3 lh-condensed’来查找包含项目名的h1标签，并使用strip()方法去掉空格和换行符。我们使用replace()方法来将换行符替换为一个空格。最后，我们将提取的数据存储到一个名为“github_trending.xlsx”的Excel文件中。

希望这些实例能够对你有所帮助。

正文

环境准备

爬取内容

写入Excel

完整代码

示例说明

你可能也喜欢

解决python列表list中的截取问题

Django报”TypeError “的原因以及解决办法

Python中的数字低通巴特沃斯滤波器