如何从嵌套的XML创建Pandas DataFrame

从嵌套的XML创建Pandas DataFrame 的过程，可以分为以下几个步骤：

导入所需的库

import xml.etree.ElementTree as ET
import pandas as pd

解析XML文件

tree = ET.parse('file.xml')
root = tree.getroot()

此处，我们假设 XML 数据如下：

<?xml version="1.0"?>
<data>
    <country name="Liechtenstein">
        <rank>1</rank>
        <year>2008</year>
        <gdppc>141100</gdppc>
        <neighbor name="Austria" direction="E"/>
        <neighbor name="Switzerland" direction="W"/>
    </country>
    <country name="Singapore">
        <rank>4</rank>
        <year>2011</year>
        <gdppc>59900</gdppc>
        <neighbor name="Malaysia" direction="N"/>
    </country>
    <country name="Panama">
        <rank>68</rank>
        <year>2011</year>
        <gdppc>13600</gdppc>
        <neighbor name="Costa Rica" direction="W"/>
        <neighbor name="Colombia" direction="E"/>
    </country>
</data>

获取 DataFrame 所需的数据

我们可以使用 iter() 或 findall() 方法来获取所需的数据，具体方法取决于 XML 的结构。如果 XML 中包含多个同名标签，我们可以使用 findall() 方法，例如获取所有国家的数据：

data = []
for country in root.findall('country'):
    rank = country.find('rank').text
    name = country.get('name')
    year = country.find('year').text
    gdppc = country.find('gdppc').text
    neighbors = [neighbor.get('name') for neighbor in country.findall('neighbor')]
    data.append((rank, name, year, gdppc, neighbors))

如果每个国家的数据都有独立的标签，我们可以使用 iter() 方法，例如获取第一个国家的数据：

for child in root.iter('country'):
    rank = child.find('rank').text
    name = child.get('name')
    year = child.find('year').text
    gdppc = child.find('gdppc').text
    neighbors = [neighbor.get('name') for neighbor in child.findall('neighbor')]

将数据转换成 DataFrame 格式

df = pd.DataFrame(data, columns=['Rank', 'Name', 'Year', 'GDP Per Capita', 'Neighbors'])

完整的代码片段如下：

import xml.etree.ElementTree as ET
import pandas as pd

tree = ET.parse('file.xml')
root = tree.getroot()

data = []
for country in root.findall('country'):
    rank = country.find('rank').text
    name = country.get('name')
    year = country.find('year').text
    gdppc = country.find('gdppc').text
    neighbors = [neighbor.get('name') for neighbor in country.findall('neighbor')]
    data.append((rank, name, year, gdppc, neighbors))

df = pd.DataFrame(data, columns=['Rank', 'Name', 'Year', 'GDP Per Capita', 'Neighbors'])

你可能也喜欢

如何从Pandas数据框架的时间戳列中移除时区

对pandas通过索引提取dataframe的行方法详解

Python数据分析Pandas Dataframe排序操作