从嵌套的XML创建Pandas DataFrame 的过程,可以分为以下几个步骤:
- 导入所需的库
import xml.etree.ElementTree as ET
import pandas as pd
- 解析XML文件
tree = ET.parse('file.xml')
root = tree.getroot()
此处,我们假设 XML 数据如下:
<?xml version="1.0"?>
<data>
<country name="Liechtenstein">
<rank>1</rank>
<year>2008</year>
<gdppc>141100</gdppc>
<neighbor name="Austria" direction="E"/>
<neighbor name="Switzerland" direction="W"/>
</country>
<country name="Singapore">
<rank>4</rank>
<year>2011</year>
<gdppc>59900</gdppc>
<neighbor name="Malaysia" direction="N"/>
</country>
<country name="Panama">
<rank>68</rank>
<year>2011</year>
<gdppc>13600</gdppc>
<neighbor name="Costa Rica" direction="W"/>
<neighbor name="Colombia" direction="E"/>
</country>
</data>
- 获取 DataFrame 所需的数据
我们可以使用 iter()
或 findall()
方法来获取所需的数据,具体方法取决于 XML 的结构。如果 XML 中包含多个同名标签,我们可以使用 findall()
方法,例如获取所有国家的数据:
data = []
for country in root.findall('country'):
rank = country.find('rank').text
name = country.get('name')
year = country.find('year').text
gdppc = country.find('gdppc').text
neighbors = [neighbor.get('name') for neighbor in country.findall('neighbor')]
data.append((rank, name, year, gdppc, neighbors))
如果每个国家的数据都有独立的标签,我们可以使用 iter()
方法,例如获取第一个国家的数据:
for child in root.iter('country'):
rank = child.find('rank').text
name = child.get('name')
year = child.find('year').text
gdppc = child.find('gdppc').text
neighbors = [neighbor.get('name') for neighbor in child.findall('neighbor')]
- 将数据转换成 DataFrame 格式
df = pd.DataFrame(data, columns=['Rank', 'Name', 'Year', 'GDP Per Capita', 'Neighbors'])
完整的代码片段如下:
import xml.etree.ElementTree as ET
import pandas as pd
tree = ET.parse('file.xml')
root = tree.getroot()
data = []
for country in root.findall('country'):
rank = country.find('rank').text
name = country.get('name')
year = country.find('year').text
gdppc = country.find('gdppc').text
neighbors = [neighbor.get('name') for neighbor in country.findall('neighbor')]
data.append((rank, name, year, gdppc, neighbors))
df = pd.DataFrame(data, columns=['Rank', 'Name', 'Year', 'GDP Per Capita', 'Neighbors'])